Using node groups with GPUs and no pre-installed drivers
You can use Managed Service for Kubernetes node groups for workloads on GPUs without pre-installed drivers. Use the GPU Operator
To prepare your cluster and Managed Service for Kubernetes node group without pre-installed drivers for running workloads:
If you no longer need the resources you created, delete them.
Required paid resources
The support cost includes:
- Fee for using the master and outgoing traffic in a Managed Service for Kubernetes cluster (see Managed Service for Kubernetes pricing).
- Fee for using computing resources, OS, and storage in cluster nodes (VMs) (see Compute Cloud pricing).
- Fee for a public IP address assigned to cluster nodes (see Virtual Private Cloud pricing).
Getting started
-
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the
yc config set folder-id <folder_ID>command. You can also set a different folder for any specific command using the--folder-nameor--folder-idparameter. -
Create security groups for the Managed Service for Kubernetes cluster and its node groups.
Warning
The configuration of security groups determines the performance and availability of the cluster and the services and applications running in it.
-
Create a Managed Service for Kubernetes cluster with any suitable configuration. When creating, specify the security groups prepared earlier.
-
Create a Managed Service for Kubernetes node group with the following settings:
- Computing resources: Navigate to the GPU tab and select a suitable platform.
- Do not install GPU drivers: Select the option.
- Security groups: Select the security groups you created earlier.
- Node taints: Specify the
nvidia.com/gpu=true:NoScheduletaint policy.
-
Install kubect
and configure it to work with the new cluster.
Component version requirements
Starting with Kubernetes version 1.30, for a group of nodes with GPUs to work correctly without pre-installed drivers, you will need the following:
- GPU Operator
24.9.0or higher. - NVIDIA driver
550.144.03or higher.
Older version components may cause driver compilation errors.
Install the GPU Operator
-
Install Helm
v3.8.0 or higher. -
Install the GPU Operator:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && \ helm repo update && \ helm install \ --namespace gpu-operator \ --create-namespace \ --set driver.version=<driver_version> \ gpu-operator nvidia/gpu-operatorWhere
driver.versionis the NVIDIA® driver version. You can omit the driver version parameter. In this case, the default version will be used.Note
Recommended driver versions:
-
For node groups Kubernetes 1.30 or higher:
550.144.03 or higher. -
For node groups on AMD EPYC™ with NVIDIA® Ampere® A100 (
gpu-standard-v3):515.48.07 .
GPU Operator will be installed with default parameters. Learn more about parameters from the official documentation
.Tip
You can view parameter values in the Helm chart's
values.yamlconfiguration file. To do this, download the Helm chart archive using thehelm pull --untar nvidia/gpu-operatorcommand. -
Check that drivers are installed correctly
Get the nvidia-driver-daemonset pod logs:
DRIVERS_POD_NAME="$(kubectl get pods --namespace gpu-operator | grep nvidia-driver-daemonset | awk '{print $1}')" && \
kubectl --namespace gpu-operator logs "${DRIVERS_POD_NAME}"
They should contain a message saying that the driver has been installed successfully, e.g.:
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-<driver_version>
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 <driver_version>
...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
...
Done, now waiting for signal
Now, you can run GPU-based workloads by following the Running workloads with GPUs guide.
Troubleshooting
Driver compilation errors
If you get compilation errors when installing drivers:
-
Make sure you are running GPU Operator 24.9.0 or higher:
helm list -n gpu-operator -
Use precompiled drivers:
helm upgrade gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --set driver.usePrecompiled=true \ --set driver.version=550.144.03
Delete the resources you created
Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:
- Delete the Kubernetes cluster.
- If you had created service accounts, delete them.