NVIDIA® GPU support
Stackland enables you to provision NVIDIA® GPUs in a Stackland cluster using NVIDIA® GPU support, a component which automates management of GPU resources and ensures their availability for workloads. An implementation of the NVIDIA® GPU Operator
NVIDIA® GPU support use cases include:
- Auto-detection of GPUs on cluster nodes
- Provisioning GPUs as Kubernetes resources for pods
- Support for GPU virtualization technologies (multi-instance GPU or MIG)
- Support for NVLink to create GPU clusters
- GPU health monitoring and metric collection
NVIDIA® GPU support requires NVIDIA® GPU nodes to operate.
Main components
NVIDIA® driver
Version: 580.126
The NVIDIA® driver provides a low-level interface between the OS and GPU. The driver exposes the GPU hardware capabilities, manages device memory, and handles commands from applications.
NVIDIA® Container Toolkit
Version: 580.126
NVIDIA® Container Toolkit enables running GPU-accelerated containers. The toolkit integrates with the container runtime and provides GPU access to containers via the Container Device Interface (CDI). This component automatically configures the container environment, mounts the required libraries and devices, and manages GPU resource isolation across containers.
NVIDIA® Fabric Manager
Version: 580.126
NVIDIA® Fabric Manager manages NVLink and NVSwitch in multi-GPU systems. This component ensures high-speed GPU interconnection, optimizes communication topology, and manages distributed memory in multi-GPU configurations.
NVIDIA® Operator
Version: 25.10
The NVIDIA® NVIDIA® GPU support automates GPU management in a Kubernetes cluster. It creates, configures, and manages components required for GPU provisioning, including drivers, libraries, device plugins, and monitoring systems. The NVIDIA® GPU Operator uses CRDs to manage the lifecycle of GPU components.
DCGM
NVIDIA® Data Center GPU Manager (DCGM) is a tool for monitoring and management of datacenter GPUs. DCGM collects performance, temperature, memory usage, and other GPU metrics.
DCGM Exporter
DCGM Exporter exports GPU metrics in Prometheus format. The monitoring component automatically collects metrics and exposes them for visualization in Grafana.
GPU monitoring
DCGM Exporter automatically collects GPU metrics and makes them available in Grafana. Stackland provides prebuilt dashboards for GPU monitoring:
- NVIDIA® DCGM Dashboard: Overview dashboard with metrics of all cluster GPUs.
- NVIDIA® DCGM Dashboard with MIG metrics: Dashboard for MIG GPU monitoring.
- NVIDIA® DCGM Dashboard w/o MIG metrics: Dashboard for non-MIG GPU monitoring.
Using GPUs in pods
To use a GPU in a pod, specify the nvidia.com/gpu resource in the container specification:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:12.0-base
resources:
limits:
nvidia.com/gpu: 1
Kubernetes will automatically place the pod on a node with an available GPU.
Configuration
MIG Manager settings
migManager:
enabled: false
strategy: "single"
config:
default: "all-disabled"
enabled: Enables multi-instance GPU support.strategy: MIG strategy. The possible values aresingleto apply the same MIG configuration to all GPUs on the node ormixedto use different MIG configurations on different GPUs.config.default: Default MIG configuration.
To enable MIG support, set enabled to true and configure the GPU node:
kubectl label nodes my-node nvidia.com/mig.config=all-1g.5gb --overwrite
This command applies a MIG profile to my-node to partition each of the node’s GPU into multiple independent GPUs, each with one compute slice and 5 GB of video memory.
To view all available MIG profiles, run the following command:
kubectl -n stackland-nvidia-gpu get cm default-mig-parted-config -o jsonpath='{.data.config\.yaml}'