Running the DeepSeek-R1 language model in a Yandex Compute Cloud GPU cluster

Written by

Updated at October 3, 2025

Get your cloud ready
- Required paid resources
Create a GPU cluster with two VMs .
- Create a GPU cluster
- Add two VMs to the cluster
Test cluster state
Run the language model
Test the language model performance
How to delete the resources you created

Note

Currently, GPU clusters are only available in the ru-central1-a and ru-central1-d availability zones. You can only add a VM to a GPU cluster from the same availability zone.

In this tutorial, you will create a GPU cluster with two VMs to run the DeepSeek-R1 language model.

To run a language model in a GPU cluster:

If you no longer need the resources you created, delete them.

Get your cloud ready

Navigate to the management console and log in to Yandex Cloud or create a new account.
On the Yandex Cloud Billing page, make sure you have a billing account linked and it has the ACTIVE or TRIAL_ACTIVE status. If you do not have a billing account, create one and link a cloud to it.

If you have an active billing account, you can navigate to the cloud page to create or select a folder for your infrastructure.

Learn more about clouds and folders here.

Make sure the cloud has enough quotas for the total number of GPU clusters, total number of Gen2 GPUs, amount of RAM, number of vCPUs, and SSD size to create the VMs. To do this, use Yandex Cloud Quota Manager.

Required paid resources

The infrastructure support costs include:

Fee for continuously running VMs and disks (see Yandex Compute Cloud pricing).

Create a GPU cluster with two VMs .

Create a GPU cluster

Management console

In the management console, select the folder where you want to create a cluster.
In the list of services, select Compute Cloud.
In the left-hand panel, select GPU clusters.
Click Create a cluster.
In the Name field, enter cluster name: test-gpu-cluster.
In the Availability zone field, select the ru-central1-d availability zone.
Click Save.

Add two VMs to the cluster

Create the first VM:
Management console
1. In the left-hand panel, select Virtual machines.
2. Click Create virtual machine.
3. Under Boot disk image, select the Ubuntu 20.04 LTS Secure Boot CUDA 12.2 public image.
4. In the Availability zone field, select the ru-central1-d availability zone.
5. Under Disks and file storages, select the SSD disk type and specify its size: 800 GB.
6. Under Computing resources, navigate to the Custom tab and specify the platform, number of GPUs, and cluster:
  - Platform: Gen2.
  - GPU: 8.
  - GPU cluster: Select the test-gpu-cluster cluster you created earlier.
7. Under Access, select SSH key and specify the VM access credentials:
  - In the Login field, enter a username, e.g., ubuntu. Do not use root or other names reserved for the OS purposes. To perform operations requiring root privileges, use the sudo command.
  - In the SSH key field, select the SSH key saved in your organization user profile.
    
    If there are no SSH keys in your profile or you want to add a new key:
    
    Click Add key.
    
    Enter a name for the SSH key.
    
    Select one of the following:
    
    Enter manually: Paste the contents of the public SSH key. You need to create an SSH key pair on your own.
    
    Load from file: Upload the public part of the SSH key. You need to create an SSH key pair on your own.
    
    Generate key: Automatically create an SSH key pair.
    
    When adding a new SSH key, an archive containing the key pair will be created and downloaded. In Linux or macOS-based operating systems, unpack the archive to the /home/<user_name>/.ssh directory. In Windows, unpack the archive to the C:\Users\<user_name>/.ssh directory. You do not need additionally enter the public key in the management console.
    
    Click Add.
    
    The system will add the SSH key to your organization user profile. If the organization has disabled the ability for users to add SSH keys to their profiles, the added public SSH key will only be saved in the user profile inside the newly created resource.
8. Click Create VM.
Similarly, create the second VM.

Test cluster state

Optionally, you can:

Run the language model

Connect to both VMs over SSH.

Add the current user to the docker group:

sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker

Pull the SGLang image to both VMs:
```
docker pull lmsysorg/sglang:latest
```

Run this command on the first VM:

docker run --gpus all --device=/dev/infiniband --ulimit memlock=-1 --ulimit stack=67108864 --shm-size 32g --network=host -v ~/.cache/huggingface:/root/.cache/huggingface --name sglang_multinode1 -e GLOO_SOCKET_IFNAME=eth0 -it --rm --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --nccl-init-addr <IP_address_1>:30000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 30001 --disable-radix --max-prefill-tokens 126000

Where IP_address_1 is the first VM internal IP address.

Run this command on the second VM:

docker run --gpus all --device=/dev/infiniband --ulimit memlock=-1 --ulimit stack=67108864 --shm-size 32g --network=host -v ~/.cache/huggingface:/root/.cache/huggingface --name sglang_multinode2 -e GLOO_SOCKET_IFNAME=eth0 -it --rm --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --nccl-init-addr <IP_address_1>:30000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 30001 --disable-radix --max-prefill-tokens 126000

Where IP_address_1 is the first VM internal IP address.

Wait for the server to start:

The server is fired up and ready to roll!

Test the language model performance

In a new session, connect to the first VM over SSH.

Install the openai package:

sudo apt update
sudo apt install python3-pip
pip install openai

Create a test_model.py script with the following contents:

import openai
client = openai.Client(
   base_url="http://127.0.0.1:30001/v1", api_key="EMPTY")

response = client.chat.completions.create(
   model="default",
   messages=[
      {"role": "system", "content": "You are a helpful AI assistant"},
      {"role": "user", "content": "List 3 countries and their capitals."},
   ],
   temperature=0.3,
   max_tokens=1024,
)
print(response.choices[0].message.content)

Run the script:

python3 test_model.py

Model response example:

Here are three countries and their capitals:

1. **France** - Paris
2. **Japan** - Tokyo
3. **Brazil** - Brasília

Let me know if you'd like more examples! 😊

How to delete the resources you created

To stop paying for the resources you created:

Delete the VM instances in Compute Cloud.
Delete the GPU cluster in Compute Cloud.

Running the DeepSeek-R1 language model in a Yandex Compute Cloud GPU cluster

Get your cloud readyGet your cloud ready

Required paid resourcesRequired paid resources

Create a GPU cluster with two VMs .Create a GPU cluster with two VMs .

Create a GPU clusterCreate a GPU cluster

Add two VMs to the clusterAdd two VMs to the cluster

Test cluster stateTest cluster state

Run the language modelRun the language model

Test the language model performanceTest the language model performance

How to delete the resources you createdHow to delete the resources you created

Was the article helpful?