Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Tutorials
    • All tutorials
    • Architecture and protection of a basic internet service
    • Cost analysis by resource using Object Storage
      • Running the DeepSeek-R1 language model in a GPU cluster
      • Running a vLLM library with the Gemma 3 language model on a VM with a GPU

In this article:

  • Get your cloud ready
  • Required paid resources
  • Get access to the Gemma 3 model
  • Create a VM with a GPU
  • Run the language model
  • Test the language model performance
  • How to delete the resources you created
  1. Basic infrastructure
  2. GPU
  3. Running a vLLM library with the Gemma 3 language model on a VM with a GPU

Running a vLLM library with the Gemma 3 language model on a Yandex Compute Cloud VM instance with a GPU

Written by
Yandex Cloud
Updated at May 13, 2025
  • Get your cloud ready
    • Required paid resources
  • Get access to the Gemma 3 model
  • Create a VM with a GPU
  • Run the language model
  • Test the language model performance
  • How to delete the resources you created

Using this tutorial, you will create a VM instance with a single GPU to run the Gemma 3 lightweight multimodal language model.

To run a language model:

  1. Get your cloud ready.
  2. Get access to the Gemma 3 model.
  3. Create a VM with a GPU.
  4. Run the language model.
  5. Test the language model performance.

If you no longer need the resources you created, delete them.

Get your cloud readyGet your cloud ready

Make sure the cloud has enough quotas for the total number of AMD EPYC™ 9474F with Gen2 GPUs, amount of RAM, number of vCPUs, and SSD size to create the VM. To do this, use Yandex Cloud Quota Manager.

Required paid resourcesRequired paid resources

The infrastructure support cost includes a fee for continuously running VMs and disks (see Yandex Compute Cloud pricing).

Get access to the Gemma 3 modelGet access to the Gemma 3 model

  1. Sign up for Hugging Face.

  2. Create an access token:

    1. After logging into your account, click your avatar → Settings → Access Tokens.
    2. Click + Create new token.
    3. Select the Read token type.
    4. Enter a token name.
    5. Click Create token.
    6. Copy the token value.
  3. Request access to the Gemma-3-27b-it model:

    1. Go to the model page.
    2. Click Request access.
    3. Accept the license terms.
    4. Wait for access confirmation.

Create a VM with a GPUCreate a VM with a GPU

Management console
  1. In the management console, select the folder where you want to create your VM.

  2. In the list of services, select Compute Cloud.

  3. In the left-hand panel, select Virtual machines.

  4. Click Create virtual machine.

  5. Under Boot disk image, select the Ubuntu 20.04 LTS Secure Boot CUDA 12.2 public image.

  6. In the Availability zone field, select the ru-central1-d availability zone.

  7. Under Disks and file storages, select the SSD disk type and set the size to at least 500 GB.

  8. Under Computing resources, navigate to the Custom tab and specify the platform and number of GPUs:

    • Platform: AMD Epyc 9474F with Gen2.
    • GPU: 1.
  9. Under Access, select SSH key and specify the VM access credentials:

    • In the Login field, enter a username, e.g., ubuntu. Do not use root or other names reserved for the OS purposes. To perform operations requiring root privileges, use the sudo command.
    • In the SSH key field, select the SSH key saved in your organization user profile.

      If there are no saved SSH keys in your profile, or you want to add a new key:

      • Click Add key.
      • Enter a name for the SSH key.
      • Upload or paste the contents of the public key file. You need to create a key pair for the SSH connection to a VM yourself.
      • Click Add.

      The SSH key will be added to your organization user profile.

      If users cannot add SSH keys to their profiles in the organization, the added public SSH key will only be saved to the user profile of the VM being created.

  10. Click Create VM.

Run the language modelRun the language model

  1. Connect to the VM over SSH.

  2. Add the current user to the docker group:

    sudo groupadd docker
    sudo usermod -aG docker $USER
    newgrp docker
    
  3. Fill in the variables:

    TOKEN=<HF_token>
    MODEL=google/gemma-3-27b-it
    MODEL_OPTS="--max-num-seqs 256 --max-model-len 16384 --gpu-memory-utilization 0.98 --max_num_batched_tokens 2048"
    

    Where HF_token is the Hugging Face access token.

  4. Run this command:

    docker run  --runtime nvidia --gpus '"device=0"' \
    --name vllm-gema3-0 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=$TOKEN" \
    --env "HF_HUB_ENABLE_HF_TRANSFER=0" \
    --env "PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True" \
    -p 8000:8000 \
    --ipc=host \
    --shm-size=32g \
    vllm/vllm-openai:latest \
    --model $MODEL $MODEL_OPTS
    
  5. Wait for the server to start:

    INFO:     Started server process [1]
    INFO:     Waiting for application startup.
    INFO:     Application startup complete.
    

Test the language model performanceTest the language model performance

  1. Connect to the VM over SSH in a new session.

  2. Install the openai package:

    sudo apt update
    sudo apt install python3-pip
    pip install openai
    
  3. Create a test_model.py script with the following contents:

    import openai
    client = openai.Client(base_url="http://127.0.0.1:8000/v1", api_key="EMPTY")
    response = client.chat.completions.create(
       model="google/gemma-3-27b-it",
       messages=[
          {
                "role": "user",
                "content": [
                   {
                      "type": "image_url",
                      "image_url": {
                            "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
                      },
                   },
                   {"type": "text", "text": "Describe this image in one sentence."},
                ],
          }
       ],
       temperature=0.3,
       max_tokens=128,
    )
    print(response.choices[0].message.content)
    
  4. Run the script:

    python3 test_model.py
    

    Model response example:

    Here's a one-sentence description of the image:
    
    The Statue of Liberty stands prominently on Liberty Island with the Manhattan skyline, including the Empire State Building, visible in the background across the water on a clear, sunny day.
    

How to delete the resources you createdHow to delete the resources you created

To stop paying for the resources you created, delete the VM instance you created in Compute Cloud.

Was the article helpful?

Previous
Running the DeepSeek-R1 language model in a GPU cluster
Next
All tutorials
© 2025 Direct Cursus Technology L.L.C.