Setting up Time-Slicing GPUs in Yandex Managed Service for Kubernetes
Time-slicing GPUs in Kubernetes
To set up time-slicing GPUs in Managed Service for Kubernetes:
If you no longer need the resources you created, delete them.
Required paid resources
The support cost for this solution includes:
- Fee for using the master and outgoing traffic in a Managed Service for Kubernetes cluster (see Managed Service for Kubernetes pricing).
- Fee for using computing resources, OS, and storage in cluster nodes (VMs) (see Compute Cloud pricing).
- Fee for a public IP address assigned to cluster nodes (see Virtual Private Cloud pricing).
Getting started
-
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
-
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the
yc config set folder-id <folder_ID>command. You can also set a different folder for any specific command using the--folder-nameor--folder-idparameter. -
Create security groups for the Managed Service for Kubernetes cluster and its node groups.
Warning
The configuration of security groups determines the performance and availability of the cluster and the services and applications running in it.
-
Create a Managed Service for Kubernetes cluster. When creating, specify the preconfigured security groups.
-
Create a Managed Service for Kubernetes node group with GPU NVIDIA® Tesla® T4 and preconfigured security groups.
-
Install kubect
and configure it to work with the new cluster.
Configure time-slicing GPUs
-
Create a time-slicing configuration:
-
Create the
time-slicing-config.yamlfile with the following content:--- kind: Namespace apiVersion: v1 metadata: name: gpu-operator --- apiVersion: v1 kind: ConfigMap metadata: name: time-slicing-config namespace: gpu-operator data: a100-80gb: |- version: v1 sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 5 tesla-t4: |- version: v1 sharing: timeSlicing: resources: - name: nvidia.com/gpu replicas: 5 -
Run this command:
kubectl create -f time-slicing-config.yamlResult:
namespace/gpu-operator created configmap/time-slicing-config created
-
-
Install GPU Operator
:helm repo add nvidia https://helm.ngc.nvidia.com/nvidia && \ helm repo update && \ helm install \ --namespace gpu-operator \ --create-namespace \ --set devicePlugin.config.name=time-slicing-config \ gpu-operator nvidia/gpu-operator -
Apply the time-slicing configuration to your Managed Service for Kubernetes cluster or node group:
Managed Service for Kubernetes clusterManaged Service for Kubernetes node groupkubectl patch clusterpolicies.nvidia.com/cluster-policy \ --namespace gpu-operator \ --type merge \ --patch='{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "tesla-t4"}}}}'yc managed-kubernetes node-group add-labels <node_group_ID_or_name> \ --labels nvidia.com/device-plugin.config=tesla-t4You can get the Managed Service for Kubernetes node group ID and name with the list of node groups in the folder.
Test time-slicing GPUs
-
Create a test app:
-
Save the following app specification to a YAML file named
nvidia-plugin-test.yml.Deployment
is the Kubernetes API object that manages the replicated application.apiVersion: apps/v1 kind: Deployment metadata: name: nvidia-plugin-test labels: app: nvidia-plugin-test spec: replicas: 5 selector: matchLabels: app: nvidia-plugin-test template: metadata: labels: app: nvidia-plugin-test spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule containers: - name: dcgmproftester11 image: nvidia/samples:dcgmproftester-2.0.10-cuda11.0-ubuntu18.04 command: ["/bin/sh", "-c"] args: - while true; do /usr/bin/dcgmproftester11 --no-dcgm-validation -t 1004 -d 300; sleep 30; done resources: limits: nvidia.com/gpu: 1 securityContext: capabilities: add: ["SYS_ADMIN"] -
Run this command:
kubectl apply -f nvidia-plugin-test.ymlResult:
deployment.apps/nvidia-plugin-test created
-
-
Make sure all the app's five Managed Service for Kubernetes pods are
Running:kubectl get pods | grep nvidia-plugin-test -
Run the
nvidia-smicommand in the runningnvidia-container-toolkitManaged Service for Kubernetes pod:kubectl exec <nvidia-container-toolkit_pod_name> \ --namespace gpu-operator -- nvidia-smiResult:
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init) Thu Jan 26 09:42:51 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:8B:00.0 Off | 0 | | N/A 72C P0 70W / 70W | 1579MiB / 15360MiB | 100% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 43108 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 43211 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 44583 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 44589 C /usr/bin/dcgmproftester11 315MiB | | 0 N/A N/A 44595 C /usr/bin/dcgmproftester11 315MiB | +-----------------------------------------------------------------------------+
Delete the resources you created
Some resources are not free of charge. Delete the resources you no longer need to avoid paying for them:
- Delete the Managed Service for Kubernetes cluster.
- If you created any service accounts, delete them.