Version: 6.3.3

Podspec

When you add a podspec to an action deployment, you can more directly manage the resources available to the action. The podspec feature works by patching the Kubernetes PodTemplateSpec generated by Fabric.

Apply podspec patch

Cortex Fabric provides a minimum PodSpec when deploying actions to Kubernetes. To add/patch more values such as: resource limits, resource requests, GPUs, and more, the podspec attribute was added to the actions deploy api.

Podspec patches may only add values in the pod template generated by Fabric, you cannot remove keys. See RFC 6902 JSON Patch section 4.1 Add for a complete description of how JSON Patch add works.

The podspec is an array of JSON objects containing a path and value

[
{ "path": "JSON Path String", value: <JSON Object> }
]

path is JSON Pointer string, it is always relative to Kubernetes PodSpec and must start with '/'. See JSON Patch for a description of JSON Pointer's string format. Paths in Fabric should normally start with /containers/0, as Fabric actions only deploys a single container in each pod.

value is any valid JSON for example an: object, array, "string", number, or boolean

There are 3 steps to applying the podspec patch to actions or Skills.

  1. You must define a podspec patch and save it locally as a json or yaml file (with the --yaml option).

    In the podspec patch examples below the memory resource allocation is applied to container 0. The requests variable reserves the amount of memory required for this resource. The limits variable sets the limit the resource may reach before the process should be killed.

    JSON example

    [
    {
    path: "/containers/0/resources",
    value: {
    requests: {
    memory: "4G", "ephemeral-storage": "100G"
    },
    limits: {
    memory: "6G"
    }
    }
    }
    ]

    YAML example:

    - path: /containers/0/resources
    value:
    requests:
    memory: "4G"
    ephemeral-storage: "100G"
    limits:
    memory: "6G"
  2. In either the cortex actions deploy or cortex skills save command use the parameter: --podspec, which takes local file path of the podspec patch as its value. This ensures that the patch is pulled in at runtime.

  3. At runtime the podspec patch is merged into the actions or Skills that are deployed to Kubernetes.

    • For daemons: It patches the ReplicaSet at /spec/template/spec
    • For a Job it patches the spec at /spec/template/spec

Example working with the container env vars

In the example below, you can see the range of customization that can be built into a podspec patch. Multiple environment variables are applied to container 0 in different ways. env1 is applied to the whole container, env2 is appended to the end of an array, and env3 is applied to a single container value.

[
{
path: "/containers/0/env/0",
value: {
name: "env1",
value: "new env var at index 0"
}
},
{
path: "/containers/0/env/-",
value: {
name: "env2",
value: "new env var appended to end of array"
}
},
{
path: "/containers/0/env/0/value",
value: "updated env var value at index 0"
}
]

GPU resource allocation

Podspec support has been added to Cortex Fabric to support another feature: GPU (graphics processing unit) allocation. GPU improves processing speed for resource expensive processes like training models.

Podspec patches allow nodes to be reserved for GPU only jobs. The GPU nodes are tainted and a podspec patch are applied to allow Fabric actions to run on GPU nodes. Taints are normally defined by the GPU vendor for example NVIDIA and applied when GPU nodes are added to a Kubernetes cluster.

For a pod to use GPUs it MUST do two things:

  • It must request a GPU resource under /containers/0/resources
  • To get scheduled on a GPU node. The pod must tolerate the node's taint under /tolerations

Example GPU pod's podspec.json:

[
{
path: "/containers/0/resources",
value: {
limits: {
"nvidia.com/gpu": "1"
}
}
},
{
path: "/tolerations",
value: [
{
key: "nvidia.com/gpu",
operator: "Equal",
value: "true",
effect: "NoSchedule"
}
]
}
]

Example spec.yaml with memory and storage

resources:
limits:
nvidia.com/gpu: “1” # requesting 1 GPU
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"

Logging with podspec

Logging and metrics are delegated to aggregators defined on the cluster and are NOT be provided in the Fabric Console.

To have access to logs you must use a third-party aggregator like Grafana, Cloud Watch, or Azure Dashboards.

Metrics are delegated to tools like Prometheus, Cloud watch, etc.

Further reading