Using node groups with GPUs and no pre-installed drivers
You can use Managed Service for Kubernetes node groups for workloads on GPUs without pre-installed drivers. Use the GPU Operator
To prepare your cluster and Managed Service for Kubernetes node group without pre-installed drivers for running workloads:
If you no longer need the resources you created, delete them.
Getting started
-
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the
--folder-name
or--folder-id
parameter. -
Create security groups for the Managed Service for Kubernetes cluster and its node groups.
Warning
The configuration of security groups determines the performance and availability of the cluster and the services and applications running in it.
-
Create a Managed Service for Kubernetes cluster with any suitable configuration. When creating it, specify the security groups prepared in advance.
-
Create a node group on a platform with a GPU, and enable Do not install GPU drivers. Specify the security groups prepared in advance.
-
Install kubectl
and configure it to work with the created cluster.
Install the GPU Operator
-
Install Helm
v3.8.0 or higher. -
Install the GPU Operator:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && \ helm repo update && \ helm install \ --namespace gpu-operator \ --create-namespace \ --set driver.version=<driver_version> \ gpu-operator nvidia/gpu-operator
Where
driver.version
is the version of an NVIDIA® driver. You can omit the driver version parameter. In this case, the default version will be used.Note
For the Managed Service for Kubernetes
AMD EPYC™ with NVIDIA® Ampere® A100
(gpu-standard-v3
) node group platform, use driver version515.48.07
.
Check that drivers are installed correctly
Get the nvidia-driver-daemonset
pod logs:
DRIVERS_POD_NAME="$(kubectl get pods --namespace gpu-operator | grep nvidia-driver-daemonset | awk '{print $1}')" && \
kubectl --namespace gpu-operator logs "${DRIVERS_POD_NAME}"
They should contain a message saying that the driver has been installed successfully, e.g.:
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-535.54.03
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 535.54.03
...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
...
Done, now waiting for signal
Now, you can run GPU-based workloads by following the Running workloads with GPUs guide.
Delete the resources you created
Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:
- Delete the Kubernetes cluster.
- If you created any service accounts, delete them.