Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
    • Yandex Cloud Partner program
  • Blog
  • Pricing
  • Documentation
© 2025 Direct Cursus Technology L.L.C.
Yandex Compute Cloud
    • All guides
      • Creating a GPU cluster
      • Adding a VM to a GPU cluster
      • Updating a GPU cluster
      • Getting information about a GPU cluster
      • Configuring GPU cluster access permissions
      • Deleting a GPU cluster
      • Testing a GPU cluster physical state
      • Running parallel tasks in a GPU cluster
      • Testing InfiniBand throughput
    • Viewing operations with resources
  • Yandex Container Solution
  • Access management
  • Terraform reference
  • Monitoring metrics
  • Audit Trails events
  • Release notes
  1. Step-by-step guides
  2. GPU clusters
  3. Running parallel tasks in a GPU cluster

Running parallel tasks in a GPU cluster

Written by
Yandex Cloud
Updated at April 18, 2025
  1. Connect to each VM in the cluster over SSH.

  2. Install Open MPI and NCCL on each VM:

    sudo apt-get update
    sudo apt-get install openmpi-bin libnccl2 libnccl-dev
    
  3. On the main VM:

    1. Clone the test repository:

      git clone https://github.com/NVIDIA/nccl-tests
      
    2. Build the tests:

      cd nccl-tests
      MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi MPI=1 make
      
    3. Copy the binary all_reduce_perf file from the build folder to all cluster VMs. The all_reduce_perf file must be in the same directory on all VMs as on the main one (e.g., ~/nccl-tests/build/).

  4. On the main VM, create paswordless SSH keys:

    ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N ""
    
  5. Copy the id_ed25519 and id_ed25519.pub files to all VMs.

  6. Add a public key to authorized_keys on each VM:

    cat ~/.ssh/id_ed25519.pub >> ~/.ssh/authorized_keys
    
  7. Run this command on the main VM:

    mpirun --host <IP_1>:8,<IP_2>:8 \
        --allow-run-as-root -np 16 \
        -mca btl_tcp_if_include eth0 \
        ~/nccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1
    

    Where:

    • IP_1:8 and IP_2:8: VM IP addresses and number of GPUs
    • np: Total number of processes, (number_of_VMs) × (GPUs_per_VM)

    Result:

    #                                                              out-of-place                       in-place
    #       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
    #        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
       536870912     134217728     float     sum      -1   8476.3   63.34  118.76      0   8573.0   62.62  117.42      0
    1073741824     268435456     float     sum      -1    16512   65.03  121.92      0    16493   65.10  122.07      0
    2147483648     536870912     float     sum      -1    32742   65.59  122.98      0    32757   65.56  122.92      0
    4294967296    1073741824     float     sum      -1    65409   65.66  123.12      0    65376   65.70  123.18      0
    8589934592    2147483648     float     sum      -1   132702   64.73  121.37      0   133186   64.50  120.93      0
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 121.467
    

Was the article helpful?

Previous
Testing a GPU cluster physical state
Next
Testing InfiniBand throughput
© 2025 Direct Cursus Technology L.L.C.