Yandex Cloud
Search
Contact UsGet started
  • Blog
  • Pricing
  • Documentation
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • ML & AI
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Customer Stories
    • Gateway to Russia
    • Cloud for Startups
    • Education and Science
  • Blog
  • Pricing
  • Documentation
Yandex project
© 2025 Yandex.Cloud LLC
Tutorials
    • All tutorials
    • Basic internet service architecture and protection
    • Cost analysis by resource using Object Storage
      • Configuring a fault-tolerant architecture in Yandex Cloud
      • Integrating an L7 load balancer with Cloud CDN and Object Storage
      • Autoscaling an instance group to process messages enqueued in Message Queue
      • Updating an instance group under load
      • Creating a budget trigger that invokes a function to stop a VM
      • Deploying a fault-tolerant architecture with preemptible VMs
      • Creating triggers that invoke a function to stop a VM and send a Telegram notification

In this article:

  • Get your cloud ready
  • Required paid resources
  • Set up a test bench
  • Create TodoList application containers
  • Deploy the infrastructure
  • Configure and run Load Testing Tool
  • Running scenarios
  • VM failure
  • Application failure
  • Availability zone failure
  • Application update
  • Scaling your DB
  • Deleting applications and cleaning up your environment
  1. Basic infrastructure
  2. Fault tolerance and scaling
  3. Configuring a fault-tolerant architecture in Yandex Cloud

Configuring a fault-tolerant architecture in Yandex Cloud

Written by
Yandex Cloud
Updated at May 7, 2025
  • Get your cloud ready
    • Required paid resources
  • Set up a test bench
    • Create TodoList application containers
    • Deploy the infrastructure
    • Configure and run Load Testing Tool
  • Running scenarios
    • VM failure
    • Application failure
    • Availability zone failure
    • Application update
    • Scaling your DB
  • Deleting applications and cleaning up your environment

In this tutorial, you will configure a fault-tolerant architecture in Yandex Cloud and test it in different scenarios.

By fault tolerance, we mean the ability of a system to operate despite failures in one or more of its components.

To configure and test the architecture:

  1. Get your cloud ready.
  2. Set up a test bench.
  3. Run test scenarios.

If you no longer need the resources you created, delete them.

Get your cloud readyGet your cloud ready

Sign up in Yandex Cloud and create a billing account:

  1. Navigate to the management console and log in to Yandex Cloud or register a new account.
  2. On the Yandex Cloud Billing page, make sure you have a billing account linked and it has the ACTIVE or TRIAL_ACTIVE status. If you do not have a billing account, create one and link a cloud to it.

If you have an active billing account, you can navigate to the cloud page to create or select a folder for your infrastructure to operate in.

Learn more about clouds and folders.

Required paid resourcesRequired paid resources

The cost of supporting a fault-tolerant Yandex Cloud architecture includes:

  • Fee for disks and continuously running VMs (see Yandex Compute Cloud pricing).
  • Fee for a continuously running Yandex Managed Service for PostgreSQL cluster (see Managed Service for PostgreSQL pricing).
  • Fee for a dynamic or static public IP address (see Yandex Virtual Private Cloud pricing).

Set up a test benchSet up a test bench

How a test bench works:

  • The application is packaged into a Docker image and pushed to Yandex Container Registry.

    Docker images are deployed on four Container Optimized Image-based VMs. The VMs form a group and reside in two availability zones.

  • A DB cluster managed by Managed Service for PostgreSQL consists of two hosts residing in different availability zones.

  • The Load Testing Tool (you can find it on Yandex Cloud Marketplace) generates the workload applied to Yandex Network Load Balancer that distributes traffic across VMs.

Create TodoList application containersCreate TodoList application containers

To get the application ready to run in Yandex Cloud:

  1. Download and unpack the repository containing the demo application source code, Terraform specifications, and a failure simulation script.

  2. Navigate to the repository:

    cd yandex-cloud-fault-tolerance-demo-master/app
    
  3. Get authenticated in Container Registry:

    yc container registry configure-docker
    
  4. Create a registry:

    yc container registry create --name todo-registry
    
  5. Create a Docker image tagged as v1:

    docker build . --tag cr.yandex/<registry_ID>/todo-demo:v1 --platform linux/amd64
    
  6. Create a Docker image tagged as v2 to test the application update:

    docker build . --build-arg COLOR_SCHEME=dark --tag cr.yandex/<registry_ID>/todo-demo:v2 --platform linux/amd64
    
  7. Push the Docker images to Container Registry:

    docker push cr.yandex/<registry_ID>/todo-demo:v1
    docker push cr.yandex/<registry_ID>/todo-demo:v2
    

Deploy the infrastructureDeploy the infrastructure

To prepare your Yandex Cloud application environment:

  1. Install Terraform.

  2. Navigate to the environment specification directory:

    cd ../terraform/app
    
  3. Initialize Terraform:

    terraform init
    
  4. Save the folder ID to the YC_FOLDER variable and the IAM token to the YC_TOKEN variable:

    export YC_FOLDER=<folder_ID>
    export YC_TOKEN=$(yc iam create-token)
    
  5. Generate a key to connect to a VM over SSH:

    ssh-keygen -t ed25519
    
  6. In the app/todo-service.tf file, specify the path to the public SSH key; the default value is ~/.ssh/id_ed25519.pub.

  7. Check the cloud quotas before deploying the required resources.

    Information about the number of new resources

    You will create the following resources:

    • Virtual Private Cloud network with three subnets in all availability zones.
    • Two service accounts:
      • One with the editor role for managing an instance group.
      • Another with the container-registry.images.puller role for downloading a Docker image to a VM instance.
    • An instance group of four Container Optimized Image VMs in the ru-central1-b and ru-central1-d availability zones.
    • A Managed Service for PostgreSQL cluster with two hosts in the ru-central1-b and ru-central1-d availability zones.
    • A network load balancer distributing traffic across VM instances in the group.
  8. Deploy and run the application:

    terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER
    

    Where:

    • yc_folder: Folder where you will deploy the application.
    • yc_token: IAM token you received previously.

To access the application, navigate to the lb_address you received after running the terraform apply command.

Configure and run Load Testing ToolConfigure and run Load Testing Tool

Warning

Before creating your Load Testing Tool, create TodoList application containers and deploy the infrastructure.

  1. Navigate to the Load Testing Tool specification directory:

    cd ../tank
    
  2. Initialize Terraform:

    terraform init
    
  3. In the tank/main.tf file, specify paths to the public and private SSH keys; the default values are ~/.ssh/id_ed25519.pub and ~/.ssh/id_ed25519, respectively.

  4. Deploy and run the VM:

    terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=<overload_token>
    

    Where:

    • yc_folder: Folder where you will deploy Load Testing Tool.
    • yc_token: IAM token of the Load Testing Tool user.
    • overload_token: Token to connect to <overload.yandex.net>. To get this token, log in, click your profile at the top right, and select My api token from the drop-down menu.
  5. Connect to the new VM over SSH. You can find the connection address in the terraform apply command output:

    ssh <username>@<VM_IP_address>
    
  6. Run Load Testing Tool:

    sudo yandex-tank -c load.yaml
    
  7. Navigate to <overload.yandex.net> and find your running task there: Public tests → show my tests only.

Running scenariosRunning scenarios

VM failureVM failure

VM failure is a scenario when the VM with your application is unavailable.

Possible causes:

  • Failure of the VM physical host.
  • You deleted the VM by mistake.

To simulate this failure, delete one of the VM instances from the group:

Management console
  1. In the management console, select your instance group folder.
  2. In the list of services, select Compute Cloud.
  3. In the left-hand panel, select Instance groups.
  4. Select the todo-ig group.
  5. Navigate to the Virtual machines panel.
  6. Click → Delete on the right of the VM you want to delete.
  7. In the window that opens, click Delete.

Test bench reaction:

  1. The network load balancer and Instance Groups get information about the VM failure and exclude it from load balancing, redirecting its traffic to the remaining instances in the group.
  2. The Instance Groups service gets automatically restored and:
    1. Deletes the failed VM instance; in our scenario, the system skips this step because the instance is already deleted.
    2. Creates a new VM.
    3. Waits for the application to start on the new VM.
    4. Adds the new VM to the load balancing group.

The load balancer and Instance Groups need some time to detect the problem and disable traffic to the failed VM. This may cause Connection Timeout errors: HTTP code 0 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

After the system removes the failed VM from the load balancing group, the other VMs take over its load.

Application failureApplication failure

Application failure is a scenario when your application does not respond in time or works incorrectly.

Possible causes:

  • Memory leak
  • DB connectivity loss
  • Too many requests

According to health check settings, the Instance Groups service polls the grouped VM instances over HTTP. A healthy VM returns a 200 status code in response to the /healthy request. Otherwise, the Instance Groups service starts the recovery process.

To simulate the yandex-cloud-fault-tolerance-demo-master repository failure, run this script:

fail_random_host.sh <instance_group_ID>

A random VM instance in the group will start returning a 503 error.

Test bench reaction:

  1. The Instance Groups service identifies the application failure and excludes the relevant VM instance from load balancing, redirecting its traffic to the remaining instances.
  2. The Instance Groups service gets automatically restored and:
    1. Restarts the failed VM.
    2. Waits for the application to start on the new VM.
    3. Adds the new VM to the load balancing group.

The Instance Groups service polls the VM several times before disabling traffic and starting the recovery process. This may cause the Service Unavailable errors: HTTP code 503 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

After the system removes the failed VM from the load balancing group, the other VMs take over its load.

Availability zone failureAvailability zone failure

Availability zone failure is a scenario when multiple VMs in the same zone become unavailable.

Possible causes:

  • Data center outage
  • Data center maintenance

To move your resources to another data center:

Management console
  1. In the management console, select your instance group folder.
  2. In the list of services, select Compute Cloud.
  3. In the left-hand panel, select Instance groups.
  4. Select the todo-ig group.
  5. In the top-right corner, click Edit.
  6. Under Allocation, uncheck the ru-central1-b availability zone.
  7. Click Save.

Test bench reaction:

  1. The Instance Groups service excludes the ru-central1-b availability zone VMs from load balancing.
  2. The system deletes these VMs, creating new VMs in their stead in the ru-central1-d zone.
  3. The Instance Groups service adds new VMs to the load balancing group.

The number of VMs that can be created and deleted at one time depends on the deployment policy.

Removing VMs from the load balancing group can cause the Connection Timeout errors: HTTP code 0 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

After the system removes the failed VMs from the load balancing group, the other VMs take over its load.

Application updateApplication update

To update your application:

Management console
  1. In the management console, select your instance group folder.
  2. In the list of services, select Compute Cloud.
  3. In the left-hand panel, select Instance groups.
  4. Select the todo-ig group.
  5. In the top-right corner, click Edit.
  6. Under Instance template, click and select Edit.
  7. Under Boot disk image, navigate to the Container Solution tab.
  8. Select the required Docker container and click → Edit.
  9. In the window that opens, specify the application image tagged as v2 in the Docker image field.
  10. Click Apply.
  11. Click Save.
  12. Click Save on the Changing an instance group page.

Test bench reaction:

  1. The Instance Groups service excludes two VMs with the outdated application from load balancing, assigning them the RUNNING_OUTDATED status.
  2. The system deletes these VMs, creating new VMs with the new application version in their stead.
  3. The Instance Groups service adds new VMs to the load balancing group.
  4. The system repeats the operations above for two remaining VMs with the outdated application version.

Refresh the application page. If the network load balancer sends your request to an updated VM, you will see the dark theme application version.

The number of VMs that can be created and deleted at one time depends on the deployment policy.

Removing VMs from the load balancing group can cause the Connection Timeout errors: HTTP code 0 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

After the system removes the failed VMs from the load balancing group, the other VMs take over its load.

Scaling your DBScaling your DB

You may need to scale your DB in the following cases:

  • Cluster performance is insufficient to handle requests.
  • The data requires more storage capacity.

To scale your DB:

Management console
  1. In the management console, select your DB cluster folder.
  2. In the list of services, select Managed Service for PostgreSQL.
  3. Select the todo-postgresql cluster.
  4. Click Edit.
  5. Under Host class, select s2.medium.
  6. Click Save changes.

Managed Service for PostgreSQL will start updating the cluster.

Switching between master and replica servers at the beginning and end of the update process can cause the Internal Server Error: HTTP code 500 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

After switching is complete, the cluster will process the requests correctly.

Deleting applications and cleaning up your environmentDeleting applications and cleaning up your environment

Warning

If you created your VM with Load Testing Tool, make sure to delete it first; otherwise, deleting the Virtual Private Cloud network will fail.

To delete the Load Testing Tool app, navigate to the yandex-cloud-fault-tolerance-demo-master/terraform/tank folder and run this command:

terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=not-used

To delete the TodoList application, navigate to the yandex-cloud-fault-tolerance-demo-master/terraform/app folder and run this command:

terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER

Was the article helpful?

Previous
Cost analysis by resource using Object Storage
Next
Overview
Yandex project
© 2025 Yandex.Cloud LLC