View Source

For running GPU workload on the EWC Kubernetes Service the two prerequisites are required:

Cluster with some GPU-enabled worker nodes (one of the vm.A6000.X flavours) running a Ubuntu GPU Image which comes with the Nvidia Driver installed on the base OS
Installed GPU operator from the Application Catalogue.
- Why not the device plugin but the operator? → The GPU operator handles the device discovery, validation, container toolkit installation, and many other important bits that we might need to do before starting to use the device plugin manually, then installs the device plugin as well.

Getting Started

Provision a new Ubuntu-based cluster using one of the options described in this guide.
Note: in the EUMETSAT side of the EWC only Ubuntu images are supported for GPU workload
If you are provisioning the cluster manually, make sure to:
1. select one of the vm.A6000.X flavours for the worker nodes.
  Note: the GPU nodes need at least 60GB of disk space to deploy the GPU operator and related workload. If you use a small flavour (e.g. vm.A6000.1) which comes with a smaller disk, specify a larger custom disk size manually
2. Change the default Image Name to a GPU-enabled image (see list of available images).
If you provision a cluster using a template, manually add new Machine Deployment with GPU-enabled worker nodes afterwards.
And provide the same values as in Step 2 above
Once the GPU nodes are provisioned, deploy the NVIDIA GPU Operator - Time Slicing application from the Application Catalogue

You can verify the status of the GPU operator by checking if all the pods in the respective namespace have been successfully deployed

> kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-tpwr4                                       2/2     Running     0             107s
gpu-operator-745ccb5b94-dzxvk                                     1/1     Running     0             3m19s
gpu-operator-gpu-operator-node-feature-discovery-master-6fpj76g   1/1     Running     0             3m19s
gpu-operator-gpu-operator-node-feature-discovery-worker-6hk95     1/1     Running     0             3m19s
gpu-operator-gpu-operator-node-feature-discovery-worker-jb2v8     1/1     Running     0             3m18s
nvidia-container-toolkit-daemonset-7gsz7                          1/1     Running     2 (86s ago)   111s
nvidia-cuda-validator-pqt4b                                       0/1     Completed   0             46s
nvidia-dcgm-exporter-hmxx8                                        1/1     Running     0             108s
nvidia-device-plugin-daemonset-2kxfq                              2/2     Running     0             110s
nvidia-device-plugin-validator-ss74n                              0/1     Completed   0             29s
nvidia-operator-validator-6tglx                                   1/1     Running     0             111s

Deploy a test workload to verify the GPU access

> cat << EOF | kubectl create -f -
 apiVersion: v1
 kind: Pod
 metadata:
   name: vector-add
 spec:
   restartPolicy: OnFailure
   containers:
   - name: vector-add
     image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
     resources:
       limits:
          nvidia.com/gpu: 1
EOF

> kubectl logs pod/vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done