Launching the DeepSeek-R1 language model in a Yandex Compute Cloud GPU cluster
Note
Currently, GPU clusters are only available in the ru-central1-a
and ru-central1-d
availability zones. You can only add a VM to a GPU cluster from the same availability zone.
In this tutorial, you will create a GPU cluster with two VMs to run the DeepSeek-R1
To run a language model in a GPU cluster:
- Get your cloud ready.
- Create a GPU cluster with two VMs.
- Test cluster state.
- Run the language model.
- Test the language model performance.
If you no longer need the resources you created, delete them.
Get your cloud ready
Sign up in Yandex Cloud and create a billing account:
- Navigate to the management console
and log in to Yandex Cloud or register a new account. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one and link a cloud to it.
If you have an active billing account, you can navigate to the cloud page
Learn more about clouds and folders.
Required paid resources
The infrastructure support costs include:
- Fee for continuously running VMs and disks (see Yandex Compute Cloud pricing).
Create a GPU cluster with two VMs .
Create a GPU cluster
- In the management console
, select the folder where you want to create a cluster. - From the list of services, select Compute Cloud.
- In the left-hand panel, select
GPU clusters. - Click Create a cluster.
- In the Name field, enter cluster name:
test-gpu-cluster
. - In the Availability zone field, select the
ru-central1-d
availability zone. - Click Save.
Add two VMs to the cluster
-
Create the first VM:
Management console-
In the left-hand panel, select
Virtual machines. -
Click Create virtual machine.
-
Under Boot disk image, select the Ubuntu 20.04 LTS Secure Boot CUDA 12.2 public image.
-
In the Availability zone field, select the
ru-central1-d
availability zone. -
Under Disks and file storages, select the
SSD
disk type and specify its size:800 GB
. -
Under Computing resources, navigate to the
Custom
tab and specify the platform, number of GPUs, and cluster:- Platform:
AMD Epyc 9474F with Gen2
. - GPU:
8
. - GPU cluster: Select the previously created
test-gpu-cluster
cluster.
- Platform:
-
Under Access, select SSH key and specify the VM access credentials:
- In the Login field, enter a username, e.g.,
ubuntu
. Do not useroot
or other names reserved for the OS purposes. To perform operations requiring root privileges, use thesudo
command.
- In the Login field, enter a username, e.g.,
-
Click Create VM.
-
-
Similarly, create the second VM.
Test cluster state
Optionally, you can:
Run the language model
-
Add the current user to the
docker
group:sudo groupadd docker sudo usermod -aG docker $USER newgrp docker
-
Pull the
SGLang
image to both VMs:docker pull lmsysorg/sglang:latest
-
Run this command on the first VM:
docker run --gpus all --device=/dev/infiniband --ulimit memlock=-1 --ulimit stack=67108864 --shm-size 32g --network=host -v ~/.cache/huggingface:/root/.cache/huggingface --name sglang_multinode1 -e GLOO_SOCKET_IFNAME=eth0 -it --rm --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --nccl-init-addr <IP_address_1>:30000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 30001 --disable-radix --max-prefill-tokens 126000
Where
IP_address_1
is the first VM internal IP address. -
Run this command on the second VM:
docker run --gpus all --device=/dev/infiniband --ulimit memlock=-1 --ulimit stack=67108864 --shm-size 32g --network=host -v ~/.cache/huggingface:/root/.cache/huggingface --name sglang_multinode2 -e GLOO_SOCKET_IFNAME=eth0 -it --rm --ipc=host lmsysorg/sglang:latest python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1 --tp 16 --nccl-init-addr <IP_address_1>:30000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 30001 --disable-radix --max-prefill-tokens 126000
Where
IP_address_1
is the first VM internal IP address. -
Wait for the server to start:
The server is fired up and ready to roll!
Test the language model performance
-
In a new session, connect to the first VM over SSH.
-
Install the
openai
package:sudo apt update sudo apt install python3-pip pip install openai
-
Create a
test_model.py
script with the following contents:import openai client = openai.Client( base_url="http://127.0.0.1:30001/v1", api_key="EMPTY") response = client.chat.completions.create( model="default", messages=[ {"role": "system", "content": "You are a helpful AI assistant"}, {"role": "user", "content": "List 3 countries and their capitals."}, ], temperature=0.3, max_tokens=1024, ) print(response.choices[0].message.content)
-
Run the script:
python3 test_model.py
Model response example:
Here are three countries and their capitals: 1. **France** - Paris 2. **Japan** - Tokyo 3. **Brazil** - Brasília Let me know if you'd like more examples! 😊
How to delete the resources you created
To stop paying for the resources you created: