Setting up Time-Slicing GPUs
The Time-Slicing GPUs plugin in Kubernetes
To install the Time-Slicing GPUs plugin in Managed Service for Kubernetes:
If you no longer need the resources you created, delete them.
Getting started
-
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
-
The folder specified in the CLI profile is used by default. You can specify a different folder using the
--folder-name
or--folder-id
parameter. -
Create security groups for the Managed Service for Kubernetes cluster and its node groups.
Warning
The configuration of security groups determines the performance and availability of the cluster and the services and applications running in it.
-
Create a Managed Service for Kubernetes cluster. When creating it, specify the security groups prepared in advance.
-
Create a Managed Service for Kubernetes node group with GPU NVIDIA® Tesla® T4 and the security groups prepared earlier.
-
Install kubectl
and configure it to work with the created cluster.
Configure Time-Slicing GPUs
-
Create a time-slicing configuration:
-
Prepare the
time-slicing-config.yaml
file with the following content:--- kind: Namespace apiVersion: v1 metadata: name: gpu-operator --- apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: a100-80gb: |- version: v1 sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 5 tesla-t4: |- version: v1 sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 5
-
Run this command:
kubectl create -f time-slicing-config.yaml
Result:
namespace/gpu-operator created configmap/time-slicing-config created
-
-
Install the GPU Operator
:helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && \ helm repo update && \ helm install \ --namespace gpu-operator \ --create-namespace \ --set devicePlugin.config.name=time-slicing-config \ gpu-operator nvidia/gpu-operator
-
Apply the time-slicing configuration to the Managed Service for Kubernetes cluster or node group:
Cluster Managed Service for KubernetesNode group Managed Service for Kuberneteskubectl patch clusterpolicies.nvidia.com/cluster-policy \ --namespace gpu-operator \ --type merge \ --patch='{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "tesla-t4"}}}}'
yc managed-kubernetes node-group add-labels <node_group_name_or_ID> \ --labels nvidia.com/device-plugin.config=tesla-t4
You can get the ID and name of the Managed Service for Kubernetes node group with a list of node groups in your cluster.
Test Time-Slicing GPUs functionality
-
Create a test app:
-
Save the following app creation specification to a YAML file named
nvidia-plugin-test.yml
.Deployment
is the Kubernetes API object that manages the replicated application.apiVersion: apps/v1 kind: Deployment metadata: name: nvidia-plugin-test labels: app: nvidia-plugin-test spec: replicas: 5 selector: matchLabels: app: nvidia-plugin-test template: metadata: labels: app: nvidia-plugin-test spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: dcgmproftester11 image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04 command: ["/bin/sh", "-c"] args: - while true; do /usr/bin/dcgmproftester11 --no-dcgm-validation -t 1004 -d 300; sleep 30; done resources: limits: nvidia.com/gpu: 1 securityContext: capabilities: add: ["SYS_ADMIN"]
-
Run this command:
kubectl apply -f nvidia-plugin-test.yml
Result:
deployment.apps/nvidia-plugin-test created
-
-
Make sure that all the app's five Managed Service for Kubernetes pods have the
Running
status:kubectl get pods | grep nvidia-plugin-test
-
Run the
nvidia-smi
command in the running Managed Service for Kubernetesnvidia-container-toolkit
pod:kubectl exec <nvidia-container-toolkit_pod_name> \ --namespace gpu-operator -- nvidia-smi
Result:
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init) Thu Jan 26 09:42:51 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:8B:00.0 Off | 0 | | N/A 72C P0 70W / 70W | 1579MiB / 15360MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 43108 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 43211 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 44583 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 44589 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 44595 C /usr/bin/dcgmproftester11 315MiB | +-----------------------------------------------------------------------------+
Delete the resources you created
Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:
- Delete the Managed Service for Kubernetes cluster.
- If you created any service accounts, delete them.