Configuring a fault-tolerant architecture in Yandex Cloud

Written by

Yandex Cloud

Updated at May 13, 2025

Get your cloud ready
- Required paid resources
Set up a test environment
Running scenarios
Deleting applications and cleaning up your environment

In this tutorial, you will configure a fault-tolerant architecture in Yandex Cloud and test it in different scenarios.

By fault tolerance, we mean the ability of a system to operate despite failures in one or more of its components.

To configure and test the architecture:

If you no longer need the resources you created, delete them.

Get your cloud ready

Navigate to the management console and log in to Yandex Cloud or register a new account.
On the Yandex Cloud Billing page, make sure you have a billing account linked and it has the ACTIVE or TRIAL_ACTIVE status. If you do not have a billing account, create one and link a cloud to it.

If you have an active billing account, you can navigate to the cloud page to create or select a folder for your infrastructure to operate in.

Learn more about clouds and folders.

Required paid resources

The cost of supporting a fault-tolerant Yandex Cloud architecture includes:

Fee for disks and continuously running VMs (see Yandex Compute Cloud pricing).
Fee for a continuously running Yandex Managed Service for PostgreSQL cluster (see Managed Service for PostgreSQL pricing).
Fee for a dynamic or static public IP address (see Yandex Virtual Private Cloud pricing).

Set up a test environment

How a test environment works:

The application is packaged into a Docker image and pushed to Yandex Container Registry.

Docker images are deployed on four Container Optimized Image-based VMs. The VMs form an instance group and reside in two availability zones.
A DB cluster managed by Managed Service for PostgreSQL consists of two hosts residing in different availability zones.
Load Testing Tool (you can find it on Yandex Cloud Marketplace) generates the workload applied to Yandex Network Load Balancer that distributes traffic across VMs.

Create TodoList application containers

To get the application ready to run in Yandex Cloud:

Download and unpack the repository containing the demo application source code, Terraform specifications, and a failure simulation script.

Navigate to the repository:

cd yandex-cloud-fault-tolerance-demo-master/app

Get authenticated in Container Registry:
```
yc container registry configure-docker
```

Create a registry:

yc container registry create --name todo-registry

Create a Docker image tagged as v1:

docker build . --tag cr.yandex/<registry_ID>/todo-demo:v1 --platform linux/amd64

Create a Docker image tagged as v2 to test the application update:

docker build . --build-arg COLOR_SCHEME=dark --tag cr.yandex/<registry_ID>/todo-demo:v2 --platform linux/amd64

Push the Docker images to Container Registry:

docker push cr.yandex/<registry_ID>/todo-demo:v1
docker push cr.yandex/<registry_ID>/todo-demo:v2

Deploy the infrastructure

To prepare your Yandex Cloud application environment:

Install Terraform.
Navigate to the environment specification directory:
```
cd ../terraform/app
```
Initialize Terraform:
```
terraform init
```
Save the folder ID to the YC_FOLDER variable and the IAM token to the YC_TOKEN variable:
```
export YC_FOLDER=<folder_ID>
export YC_TOKEN=$(yc iam create-token)
```
Generate a key to connect to a VM over SSH:
```
ssh-keygen -t ed25519
```
In the app/todo-service.tf file, specify the path to the public SSH key; the default value is ~/.ssh/id_ed25519.pub.
Check the cloud quotas before deploying the required resources.
Information about the number of new resources
You will create the following resources:
- A Virtual Private Cloud network with three subnets in all availability zones.
- Two service accounts:
  - One with the editor role for managing an instance group.
  - Another with the container-registry.images.puller role for downloading a Docker image to a VM instance.
- An instance group of four Container Optimized Image VMs in the ru-central1-b and ru-central1-d availability zones.
- A Managed Service for PostgreSQL cluster with two hosts in the ru-central1-b and ru-central1-d availability zones.
- A network load balancer distributing traffic across VM instances in the group.
Deploy and run the application:
```
terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER
```
Where:
- yc_folder: Folder where you will deploy the application.
- yc_token: IAM token of the user to deploy the application.

To access the application, navigate to lb_address received in the terraform apply command output.

Configure and run Load Testing Tool

Warning

Before creating your Load Testing Tool, create TodoList application containers and deploy the infrastructure.

Navigate to the Load Testing Tool specification directory:
```
cd ../tank
```
Initialize Terraform:
```
terraform init
```
In the tank/main.tf file, specify the paths to the public and private SSH keys; the default values are ~/.ssh/id_ed25519.pub and ~/.ssh/id_ed25519, respectively.
Deploy and run the VM:
```
terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=<overload_token>
```
Where:
- yc_folder: Folder where you will deploy Load Testing Tool.
- yc_token: IAM token of the Load Testing Tool user.
- overload_token: Token to connect to <overload.yandex.net>. To get this token, log in, click your profile at the top right, and select My api token from the drop-down menu.
Connect to the new VM over SSH. You can find the connection address in the terraform apply command output:
```
ssh <username>@<VM_IP_address>
```
Run Load Testing Tool:
```
sudo yandex-tank -c load.yaml
```
Navigate to <overload.yandex.net> and find your running task there: Public tests → show my tests only.

Running scenarios

VM failure

VM failure is a scenario when the VM with your application is unavailable.

Possible causes:

The VM physical host failed.
You deleted the VM by mistake.

To simulate this failure, delete one of the VM instances from the group:

Management console

In the management console, select your instance group folder.
In the list of services, select Compute Cloud.
In the left-hand panel, select Instance groups.
Select todo-ig.
Navigate to the Virtual machines panel.
Next to the VM you want to delete, click → Delete.
In the window that opens, click Delete.

Test environment response:

The network load balancer and Instance Groups identify the VM failure and remove this VM from the load balancing pool, redirecting its traffic to the remaining instances in the group.
The Instance Groups service gets automatically restored and:
1. Deletes the failed VM instance; in our scenario, the system skips this step because the instance is already deleted.
2. Creates a new VM.
3. Waits for the application to start on the new VM.
4. Adds the new VM to the load balancing pool.

The load balancer and Instance Groups need some time to detect the issue and disable traffic to the failed VM. This may cause Connection Timeout errors: HTTP code 0 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

Once the failed VM is removed from the load balancing pool, the system continues to handle user traffic properly.

Application failure

Application failure is a scenario when your application does not respond in time or works incorrectly.

Possible causes:

Memory leak
DB connectivity loss
Too many requests

According to health check settings, the Instance Groups service polls the grouped VM instances over HTTP. A healthy VM returns a 200 status code in response to the /healthy request. Otherwise, the Instance Groups service starts the recovery process.

To simulate the yandex-cloud-fault-tolerance-demo-master repository failure, run this script:

fail_random_host.sh <instance_group_ID>

A random VM instance in the group will start returning a 503 error.

Test environment response:

The Instance Groups service identifies the application failure and removes the relevant VM instance from the load balancing pool, redirecting its traffic to the remaining instances in the group.
The Instance Groups service gets automatically restored and:
1. Restarts the failed VM.
2. Waits for the application to start on the new VM.
3. Adds the new VM to the load balancing pool.

The Instance Groups service polls the VM several times before disabling traffic and starting the recovery process. This may cause the Service Unavailable errors: HTTP code 503 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

Once the failed VM is removed from the load balancing pool, the system continues to handle user traffic properly.

Availability zone failure

Availability zone failure is a scenario when multiple VMs in the same zone become unavailable.

Possible causes:

Data center outage
Data center maintenance

To move your resources to another data center:

Management console

In the management console, select your instance group folder.
In the list of services, select Compute Cloud.
In the left-hand panel, select Instance groups.
Select todo-ig.
In the top-right corner, click Edit.
Under Allocation, uncheck the ru-central1-b availability zone.
Click Save.

Test environment response:

The Instance Groups service removes the ru-central1-b availability zone VMs from the load balancing pool.
The system deletes these VMs, creating new VMs in their stead in the ru-central1-d zone.
The Instance Groups service adds the new VMs to the load balancing pool.

The number of VMs that can be created and deleted at one time depends on the deployment policy.

Removing VMs from the load balancing pool can cause the Connection Timeout errors: HTTP code 0 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

Once the failed VMs are removed from the load balancing pool, the system continues to handle user traffic properly.

Application update

To update your application:

Management console

In the management console, select your instance group folder.
In the list of services, select Compute Cloud.
In the left-hand panel, select Instance groups.
Select todo-ig.
In the top-right corner, click Edit.
Under Instance template, click and select Edit.
Under Boot disk image, navigate to the Container Solution tab.
Select the relevant Docker container and click → Edit.
In the window that opens, specify the application image tagged as v2 in the Docker image field.
Click Apply.
Click Save.
Click Save on the Changing an instance group page.

Test environment response:

The Instance Groups service removes two VMs running outdated application versions from the load balancing pool, assigning them the RUNNING_OUTDATED status.
The system deletes these VMs, creating new VMs with the new application version in their stead.
The Instance Groups service adds new VMs to the load balancing pool.
The system repeats the operations above for two remaining VMs with the outdated application version.

Refresh the application page. If the network load balancer sends your request to an updated VM, you will see the dark theme application version.

The number of VMs that can be created and deleted at one time depends on the deployment policy.

Removing VMs from the load balancing pool can cause the Connection Timeout errors: HTTP code 0 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

Once the failed VMs are removed from the load balancing pool, the system continues to handle user traffic properly.

Scaling your DB

You may need to scale your DB in the following cases:

Cluster performance is insufficient to handle requests.
The data requires more storage capacity.

To scale your DB:

Management console

In the management console, select your DB cluster folder.
In the list of services, select Managed Service for PostgreSQL.
Select the todo-postgresql cluster.
Click Edit.
Under Host class, select s2.medium.
Click Save changes.

Managed Service for PostgreSQL will start updating the cluster.

Switching between master and replica servers at the beginning and end of the update process can cause the Internal Server Error: HTTP code 500 in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.

After switching is complete, the cluster will process the requests correctly.

Deleting applications and cleaning up your environment

Warning

If you created your VM with Load Testing Tool, make sure to delete it first; otherwise, deleting the Virtual Private Cloud network will fail.

To delete the Load Testing Tool app, navigate to the yandex-cloud-fault-tolerance-demo-master/terraform/tank folder and run this command:

terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=not-used

To delete the TodoList application, navigate to the yandex-cloud-fault-tolerance-demo-master/terraform/app folder and run this command:

terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER

Configuring a fault-tolerant architecture in Yandex Cloud

Get your cloud readyGet your cloud ready

Required paid resourcesRequired paid resources

Set up a test environmentSet up a test environment

Create TodoList application containersCreate TodoList application containers

Deploy the infrastructureDeploy the infrastructure

Configure and run Load Testing ToolConfigure and run Load Testing Tool

Running scenariosRunning scenarios

VM failureVM failure

Application failureApplication failure

Availability zone failureAvailability zone failure

Application updateApplication update

Scaling your DBScaling your DB

Deleting applications and cleaning up your environmentDeleting applications and cleaning up your environment

Was the article helpful?

Get your cloud ready

Required paid resources

Set up a test environment

Create TodoList application containers

Deploy the infrastructure

Configure and run Load Testing Tool

Running scenarios

VM failure

Application failure

Availability zone failure

Application update

Scaling your DB

Deleting applications and cleaning up your environment