CUDA on Kubernetes

With the LLM era upon us, I’ve been wanting to play around with some of the open source, self-hosted toys available. I’m using an old workstation as a homelab, which conveniently has an old NVIDIA GPU installed. Seeing as I’m running a Kubernetes cluster I want to expose the GPU to the workloads to utilise the existing infrastructure for easy hosting, scheduling, and deployment of GPU assisted applications.

This article is mainly intended to serve as reference material for myself when I get started with the actual applications, but I hope it will be of use to others as well.

I’m currently running a Kubernetes 1.28 “cluster” on a bare-metal, one-node Debian 11 machine using containerd, so this article will assume a similar setup, though I’ve tried to link to relevant resources for other setups.

In the future I’m planning to switch over to using Proxmox or something similar for virtualisation. When that time comes I’ll probably also update this article with a new configuration.

Configuration
#

Prerequisites for the NVIDIA k8s-device-plugin are functioning NVIDIA CUDA drivers and the Container Toolkit installed on the nodes running the workload.

CUDA Driver
#

Before starting make sure you don’t have any existing NVIDIA drivers by uninstalling them using

sudo apt-get autoremove cuda* nvidia* nouveau* --purge

and rebooting your computer.

Uninstalling the graphics drivers will probably break your desktop environment. It should fix itself with new drivers from the steps below.

Before installing the GPU driver we need the appropriate kernel headers which can be fetched by running

sudo apt-get install linux-headers-$(uname -r)

Next we add the keyring and repository for the CUDA driver

distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb

such that we can easily install the driver using apt-get

sudo apt-get update
sudo apt-get install cuda-drivers

Reboot and make sure that the driver is working by running

nvidia-smi

You should then be greeted with information about your connected GPU and driver version

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        On  | 00000000:02:00.0 Off |                  N/A |
| 30%   37C    P8              12W / 180W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Container Toolkit
#

Following the NVIDIA Container Toolkit installation guide for apt we start by configuring the Container Toolkit package repository

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg  && \
  curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

before we installing nvidia-container-toolkit

sudo apt-get update 
sudo apt-get install nvidia-container-toolkit

containerd runtime
#

Take a backup of your existing containerd configuration in case something goes wrong in the following steps

sudo cp /etc/containerd/config.toml /etc/containerd/config.toml.bak

We can then either configure containerd manually according to the k8s-device-plugin readme, or run

sudo nvidia-ctk runtime configure --runtime=containerd

to set up nvidia-container-runtime as the default low-level runtime for containerd.

NVIDIA Device Plugin
#

Having cleared all the prerequisites of installing a working CUDA driver, setting up the NVIDIA Container Toolkit, and configuring containerd to use the NVIDIA runtime, we can now apply the NVIDIA device plugin using its Helm chart.

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --include-crds \
  --create-namespace \
  --version 0.14.3

Time-slicing (Optional)
#

The default behaviour of the NVIDIA device plugin is to allocate the entire GPU to a single pod, meaning that if you have multiple pods requesting GPU-time, only one will be scheduled at a time.

To overcome this we can configure time-slicing of the GPU, meaning that the GPU will be shared between pods.

Configure time-slicing by first creating a ConfigMap with the following configuration to configure a maximum of 10 replicas (line 14).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# cm-time-slicing.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cm-time-slicing
  namespace: nvidia-device-plugin
data:
  time-slicing: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 10

We then apply the ConfigMap and configure nvidia-device-plugin to use it by name (line 5) and the supplied default configuration key (line 8)

kubectl apply -f cm-time-slicing.yaml
helm upgrade nvdp nvdp/nvidia-device-plugin \
  --reuse-values \
  --set config.name=cm-time-slicing \
  --set config.default=time-slicing

You should now see a capacity of 10 nvidia.com/gpu on each node per GPU by running

kubectl get node -o 'jsonpath={.items[*].status.capacity}' | jq

{
  ...
  "nvidia.com/gpu": "10",
  ...
}

Note that the workloads are granted replicas from the same GPU, and that each workload has access to the same GPU memory and runs in the same fault-domain, meaning that if one workload crashes, they all will.

More details about configuring the device plugin can be found in the readme on GitHub.

Running a workload
#

Assuming the configuration went well, we can now try to run a test workload using the GPU by starting a pod which requests a GPU-resource (line 11-13).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
  namespace: cuda-test
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vectoradd
      image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
      resources:
        limits:
          nvidia.com/gpu: "1"

kubectl create ns cuda-test
kubectl apply -f cuda-vectoradd.yaml

If everything went well the logs of the workload should read

kubectl logs -n cuda-test cuda-vectoradd
[Vector addition of 50000 elements]
...
Test PASSED

If everything works it’s as easy as adding a resource limit for nvidia.com/gpu on each workload that you want to give access to GPU-resources.

resources:
  limits:
    nvidia.com/gpu: "1"

Peeking inside a pod requesting GPU-resources we’ll also find two NVIDIA-related environment variables,

kubectl exec -it <pod> -- env | grep NVIDIA
NVIDIA_DRIVER_CAPABILITIES=compute,video,utility
NVIDIA_VISIBLE_DEVICES=GPU-<UUID>

which reveals that we have GPU accelerated compute and video encoding/decoding available in the pod.

Troubleshooting
#

If you get a pod startup error similar to

0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod..

it could mean you don’t have enough GPU-resources available, try increasing the time-slicing replicas from the Time Slicing section — or buy another GPU, whatever is more cost-efficient for you.

I’ve also experienced the error when multiple long-running workloads try to start after rebooting the node. Restarting the nvidia-device-plugin pod and workloads requesting GPU-resources seems to fix the issue.

Using Argo CD I’ve added a negative sync-wave annotation to make sure the nvidia-device-plugin is started before the workloads to avoid this issue

annotations:
  argocd.argoproj.io/sync-wave: "-1"

Addendum
#

I first tried to use the NVIDIA GPU Operator which I interpreted to be a catch-all solution, installing the device plugin as well as the drivers and container toolkit. However, I couldn’t get it to work, so I opted for the unfortunately more manual approach of installing the device plugin, drivers and container toolkit as separate components.

It could be something with my setup, or I’ve misunderstood something in the documentation. If you have a solution I’d love to hear from you!

Summary
#

I’m running Argo CD with Kustomize + Helm in an attempt to follow GitOps best-practices. My full homelab configuration as of the writing of this article can be found on GitHub as a reference.

# kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
commonAnnotations:
  argocd.argoproj.io/sync-wave: "-1"

resources:
  - namespace.yaml
  - cm-time-slicing.yaml

helmCharts:
  - name: nvidia-device-plugin
    repo: https://nvidia.github.io/k8s-device-plugin
    version: 0.14.2
    releaseName: "nvidia-device-plugin"
    namespace: nvidia-device-plugin
    includeCRDs: true
    valuesFile: values.yaml

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: nvidia-device-plugin

# values.yaml
config:
  name: cm-time-slicing
  default: time-slicing

# cm-time-slicing.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: cm-time-slicing
  namespace: nvidia-device-plugin
data:
  time-slicing: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 10

Configuration#

CUDA Driver#

Container Toolkit#

containerd runtime#

NVIDIA Device Plugin#

Time-slicing (Optional)#

Running a workload#

Troubleshooting#

Addendum#

Summary#