Adding a VM to a GPU cluster

Written by

Updated at April 22, 2025

In GPU clusters, you can only create VMs running on the AMD EPYC™ with NVIDIA® Ampere® A100 platform with 8 GPUs. You will need to prepare a disk image with drivers by following this guide and use it when creating your VM.

Note

Currently, GPU clusters are only available in the ru-central1-a and ru-central1-d availability zones. You can only add a VM to a GPU cluster from the same availability zone.

CLI

Terraform

export YC_GPU_CLUSTER=$(yc compute gpu-cluster list --format=json | jq -r .[].id)
export YC_ZONE="ru-central1-a"
export SUBNET_NAME="my-subnet-name"
export SUBNET_ID=$(yc vpc subnet get --name=$SUBNET_NAME --format=json | jq -r .id)
yc compute instance create --name node-gpu-test \
  --create-boot-disk size=64G,image-id=<image_ID_with_drivers>,type=network-ssd \
  --ssh-key=$HOME/.ssh/id_rsa.pub \
  --gpus 8 --cores 224 --memory=952G \
  --zone $YC_ZONE \
  --network-interface subnet-id=$SUBNET_ID,nat-ip-version=ipv4 \
  --platform gpu-standard-v3 \
  --gpu-cluster-id=$YC_GPU_CLUSTER

If you do not have Terraform yet, install it and configure its Yandex Cloud provider.

In the Terraform configuration file, define the parameters of the resource you want to create:

provider "yandex" {
  zone = "ru-central1-a"
}

resource "yandex_compute_disk" "boot-disk" {
  name     = "<disk_name>"
  type     = "<disk_type>"
  zone     = "ru-central1-a"
  size     = "<disk_size>"
  image_id = "<image_ID_with_drivers>"
}

resource "yandex_compute_instance" "default" {
  name           = "vm-gpu"
  platform_id    = "gpu-standard-v3"
  zone           = "ru-central1-a"
  gpu_cluster_id = "<GPU_cluster_ID>"

  resources {
    cores  = "224"
    memory = "952"
    gpus   = "8"
  }

  boot_disk {
    disk_id = yandex_compute_disk.boot-disk.id
  }

  network_interface {
    subnet_id = "${yandex_vpc_subnet.subnet-1.id}"
    nat       = true
  }

  metadata = {
    user-data = "#cloud-config\nusers:\n  - name: <username>\n    groups: sudo\n    shell: /bin/bash\n    sudo: 'ALL=(ALL) NOPASSWD:ALL'\n    ssh_authorized_keys:\n      - ${file("<path_to_public_SSH_key>")}"
  }
}

resource "yandex_vpc_network" "network-1" {
  name = "network1"
}

resource "yandex_vpc_subnet" "subnet-1" {
  name           = "subnet1"
  zone           = "<availability_zone>"
  v4_cidr_blocks = ["192.168.10.0/24"]
  network_id     = "${yandex_vpc_network.network-1.id}"
}

Where:

yandex_compute_disk: Boot disk description, where image_id is the ID of the image with the drivers.
gpu_cluster_id: GPU cluster ID. This is a required parameter.
yandex_vpc_network: Cloud network description.
yandex_vpc_subnet: Description of the subnet to create your VM in.

Note

If you already have suitable resources, such as a cloud network and subnet, you do not need to redefine them. Specify their names and IDs in the appropriate parameters.

For more information about the yandex_compute_instance resource properties, see the Terraform provider documentation.

Under metadata, specify your username and path to the public SSH key. For more information, see VM metadata.
Create the resources:

In the terminal, change to the folder where you edited the configuration file.
Make sure the configuration file is correct using the command:
```
terraform validate
```
If the configuration is correct, the following message is returned:
```
Success! The configuration is valid.
```
Run the command:
```
terraform plan
```
The terminal will display a list of resources with parameters. No changes are made at this step. If the configuration contains errors, Terraform will point them out.
Apply the configuration changes:
```
terraform apply
```
Confirm the changes: type yes in the terminal and press Enter.

This will create a VM in the specified GPU cluster. You can check the new VM and its configuration using the management console or this CLI command:

yc compute instance get <VM_name>

Adding a VM to a GPU cluster

Was the article helpful?