NVIDIA driver update guide
Warning
This guide covers the gpu-standard-v3 (AMD EPYC™ with NVIDIA® Ampere® A100) and gpu-standard-v3i platforms.
For gpu-standard-v3i, you can only use an image with the NVIDIA 535 driver and secure-boot support. The video driver cannot be updated on this platform; you can only update the CUDA library.
Supported drivers and recommendations
In Yandex Cloud, the gpu-standard-v3 (AMD EPYC™ with NVIDIA® Ampere® A100) and gpu-standard-v3i VMs are preconfigured with the NVIDIA 535 driver.
We recommend using this specific driver version; driver updates to other versions are not supported and may lead to unstable GPU performance.
Shared NVSwitch virtualization model
We use the Shared NVSwitch virtualization model described in NVIDIA Fabric Manager User Guide
NVSwitch devices are taken to a separate auxiliary VM and controlled by the NVIDIA 535 driver. When you start a guest VM, GPUs are preconfigured for NVLink; to keep this preconfiguration, you are not allowed to software reset GPUs from user VMs in Yandex Cloud.
If you update the user VM driver to another version, e.g., 570, the driver may fail to recognize the current GPU state. This is a NVIDIA driver limitation. This is why we do not recommend changing the user VM driver version.
Why use driver version 535?
NVIDIA publishes multiple driver branches (NVIDIA Data Center Drivers Overview
- LTSB (Long-Term Support Branch): Long-term support, security updates and fixes for 3 years.
- PB (Production Branch): Main branch for data centers.
- NFB (New Feature Branch): Drivers with new features.
Version 535 belongs to LTSB; it is validated and supported in the Yandex Cloud infrastructure. Drivers from other branches fail the compatibility check and may work incorrectly.
CUDA update
Oftentimes, it is not a new driver that you need but CUDA Toolkit update. In most cases, you do not need to update the driver, it is enough to install the required CUDA version and the cuda-compat package for compatibility with the 535 driver (CUDA Forward Compatibility
Ubuntu installation example
-
Connect the NVIDIA CUDA repository:
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu$(lsb_release -rs | sed -e 's/\.//')/x86_64/3bf863cc.pub sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /" sudo apt update -
Install
cuda-compat(example for CUDA 12.5):sudo apt install -y cuda-compat-12-5 echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.5/compat:$LD_LIBRARY_PATH' >> ~/.bashrc && source ~/.bashrc -
Check the current configuration:
nvidia-smi
Issue with sudo rebootafter updating the driver to a version higher than 535 and the recommended workaround
When you reinstall the driver followed by sudo reboot, the driver does not have enough time to upload correctly. And as Yandex Cloud prohibits GPU software reset, the card remains invalid. While this does not cause any hardware issues, the VM will operate incorrectly. Use the yc compute instance restart command instead of sudo reboot.
This is why we do not recommend updating the driver to a version higher than 535. If you need to install a driver version higher than 535 and reboot the user VM, use the following workaround scenario:
-
Install
the driver.Script for Ubuntu
#!/bin/bash set -e # Fixing the architecture arch="x86_64" # Figuring out the Ubuntu version (20.04 -> ubuntu2004, 22.04 -> ubuntu2204, etc.) . /etc/os-release if [[ "$ID" != "ubuntu" ]]; then echo "This script is for Ubuntu only!" exit 1 fi distro="ubuntu${VERSION_ID//./}" echo "Using the repository: $distro/$arch" # 1. Downloading the package with keys wget https://developer.download.nvidia.com/compute/cuda/repos/${distro}/${arch}/cuda-keyring_1.1-1_all.deb # 2. Installing the keys sudo dpkg -i cuda-keyring_1.1-1_all.deb || { echo "Failed to install cuda-keyring, performing alternative steps..." # 2a. Uploading the GPG key manually wget https://developer.download.nvidia.com/compute/cuda/repos/${distro}/${arch}/cuda-archive-keyring.gpg # 2b. Putting the key in the correct location sudo mv cuda-archive-keyring.gpg /usr/share/keyrings/cuda-archive-keyring.gpg # 2c. Connecting the CUDA repository manually echo "deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] \ https://developer.download.nvidia.com/compute/cuda/repos/${distro}/${arch}/ /" \ | sudo tee /etc/apt/sources.list.d/cuda-${distro}-${arch}.list } # 3. Updating the list of packages sudo apt update # 4. Installing NVIDIA drivers sudo apt install -y nvidia-open # 5. Installing the CUDA driver metapackage sudo apt install -y cuda-drivers -
You need to go through the next steps before you reboot the system via
sudo reboot.
Create a script named/usr/libexec/manage-nvidia:#!/bin/bash set -eu usage() { echo "usage: manage-nvidia (load|unload)" exit 1 } [ $# -eq 1 ] || usage case "$1" in load) modprobe nvidia ;; unload) modprobe -r nvidia_uvm nvidia_drm nvidia_modeset nvidia ;; *) usage ;; esac -
Make the script executable:
sudo chmod +x /usr/libexec/manage-nvidia -
Create a systemd unit named
/etc/systemd/system/manage-nvidia.service:[Unit] Description=Manage NVIDIA driver Before=nvidia-persistenced.service [Service] Type=oneshot ExecStart=/usr/libexec/manage-nvidia load RemainAfterExit=true ExecStop=/usr/libexec/manage-nvidia unload StandardOutput=journal [Install] WantedBy=multi-user.target RequiredBy=nvidia-persistenced.service -
Reload the
systemdconfiguration, configuremanage-nvidiato autorun on boot, and start the service itself:sudo systemctl daemon-reload sudo systemctl enable --now manage-nvidiaExpected output if the execution is successful:
Created symlink /etc/systemd/system/multi-user.target.wants/manage-nvidia.service → /etc/systemd/system/manage-nvidia.service. Created symlink /etc/systemd/system/nvidia-persistenced.service.requires/manage-nvidia.service → /etc/systemd/system/manage-nvidia.service.Check
nvidia-persistenced.servicefor dependency onmanage-nvidia.service:sudo systemctl list-dependencies nvidia-persistenced | grep manage-nvidiaResult:
● ├─manage-nvidia.serviceCheck the service status:
sudo systemctl status manage-nvidia
With that done, during sudo reboot, systemd will call ExecStop for manage-nvidia, the driver will be uploaded correctly, and rebooting will not invalidate the GPU.