Adding a VM to a GPU cluster
In GPU clusters, you can only create VMs running on the AMD EPYC™ with NVIDIA® Ampere® A100 platform with 8 GPUs. You will need to prepare a disk image with drivers by following this guide and use it when creating your VM.
Note
Currently, GPU clusters are only available in theru-central1-a
availability zone. You can only add a VM to a GPU cluster from the same availability zone.
export YC_GPU_CLUSTER=$(yc compute gpu-cluster list --format=json | jq -r .[].id)
export YC_ZONE="ru-central1-a"
export SUBNET_NAME="my-subnet-name"
export SUBNET_ID=$(yc vpc subnet get --name=$SUBNET_NAME --format=json | jq -r .id)
yc compute instance create --name node-gpu-test \
--create-boot-disk size=64G,image-id=<ID_of_image_with_drivers>,type=network-ssd \
--ssh-key=$HOME/.ssh/id_rsa.pub \
--gpus 8 --cores 224 --memory=952G \
--zone $YC_ZONE \
--network-interface subnet-id=$SUBNET_ID,nat-ip-version=ipv4 \
--platform gpu-standard-v3 \
--gpu-cluster-id=$YC_GPU_CLUSTER
If you don't have Terraform, install it and configure the Yandex Cloud provider.
-
In the Terraform configuration file, describe the parameters of the resource to create:
provider "yandex" { zone = "ru-central1-a" } resource "yandex_compute_disk" "boot-disk" { name = "<disk_name>" type = "<disk_type>" zone = "ru-central1-a" size = "<disk_size>" image_id = "<ID_of_image_with_drivers>" } resource "yandex_compute_instance" "default" { name = "vm-gpu" platform_id = "gpu-standard-v3" zone = "ru-central1-a" gpu_cluster_id = "<GPU_cluster_ID>" resources { cores = "224" memory = "952" gpus = "8" } boot_disk { disk_id = yandex_compute_disk.boot-disk.id } network_interface { subnet_id = "${yandex_vpc_subnet.subnet-1.id}" nat = true } metadata = { user-data = "#cloud-config\nusers:\n - name: <username>\n groups: sudo\n shell: /bin/bash\n sudo: 'ALL=(ALL) NOPASSWD:ALL'\n ssh_authorized_keys:\n - ${file("<path_to_public_SSH_key>")}" } } resource "yandex_vpc_network" "network-1" { name = "network1" } resource "yandex_vpc_subnet" "subnet-1" { name = "subnet1" zone = "<availability_zone>" v4_cidr_blocks = ["192.168.10.0/24"] network_id = "${yandex_vpc_network.network-1.id}" }
Where:
-
yandex_compute_disk
: Boot disk description, whereimage_id
is the ID of the image with the drivers. -
gpu_cluster_id
: GPU cluster ID. This is a required parameter. -
yandex_vpc_network
: Cloud network description. -
yandex_vpc_subnet
: Description of the subnet to create your VM in.Note
If you already have suitable resources, such as a cloud network and subnet, you do not need to describe them again. Use their names and IDs in the appropriate parameters.
For more information about the
yandex_compute_instance
resource parameters, see the Terraform provider documentation .
-
-
Under
metadata
, specify your username and path to the public SSH key. For more information, see VM metadata. -
Create resources:
-
In the terminal, change to the folder where you edited the configuration file.
-
Make sure the configuration file is correct using the command:
terraform validate
If the configuration is correct, the following message is returned:
Success! The configuration is valid.
-
Run the command:
terraform plan
The terminal will display a list of resources with parameters. No changes are made at this step. If the configuration contains errors, Terraform will point them out.
-
Apply the configuration changes:
terraform apply
-
Confirm the changes: type
yes
in the terminal and press Enter.
This will create a VM in the specified GPU cluster. You can check the new VM and its configuration using the management console
yc compute instance get <VM_name>