Configuring a fault-tolerant architecture in Yandex Cloud
In this tutorial, you will configure a fault-tolerant architecture in Yandex Cloud and test it with different scenarios.
By fault tolerance, we mean the ability of a system to operate despite failures in one or more of its components.
To configure and test the architecture:
If you no longer need the resources you created, delete them.
Get your cloud ready
Sign up for Yandex Cloud and create a billing account:
- Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one.
If you have an active billing account, you can go to the cloud page
Learn more about clouds and folders.
Required paid resources
The cost of supporting a fault-tolerant Yandex Cloud architecture includes:
- Fee for disks and continuously running VMs (see Yandex Compute Cloud pricing).
- Fee for a continuously running Yandex Managed Service for PostgreSQL cluster (see Managed Service for PostgreSQL pricing).
- Fee for a dynamic or static public IP address (see Yandex Virtual Private Cloud pricing).
Set up a test bench
How a test bench works:
-
The application is packaged into a Docker image and pushed to Yandex Container Registry.
Docker images are deployed on four Container Optimized Image-based VMs. The VMs form a group and reside in two availability zones.
-
A DB cluster managed by Managed Service for PostgreSQL consists of two hosts residing in different availability zones.
-
The Load Testing Tool (you can find it on Yandex Cloud Marketplace) generates the workload applied to Yandex Network Load Balancer that distributes traffic across VMs.
Create TodoList application containers
To get the application ready to run in Yandex Cloud:
-
Download and unpack the repository
containing the demo application source code, Terraform specifications, and a failure simulation script. -
Navigate to the repository:
cd yandex-cloud-fault-tolerance-demo-master/app
-
Get authenticated in Container Registry:
yc container registry configure-docker
-
yc container registry create --name todo-registry
-
Create a Docker image tagged as
v1
:docker build . --tag cr.yandex/<registry_ID>/todo-demo:v1 --platform linux/amd64
-
Create a Docker image tagged as
v2
to test the application update:docker build . --build-arg COLOR_SCHEME=dark --tag cr.yandex/<registry_ID>/todo-demo:v2 --platform linux/amd64
-
Push the Docker images to Container Registry:
docker push cr.yandex/<registry_ID>/todo-demo:v1 docker push cr.yandex/<registry_ID>/todo-demo:v2
Deploy the infrastructure
To prepare your Yandex Cloud application environment:
-
Navigate to the environment specification directory:
cd ../terraform/app
-
Initialize Terraform:
terraform init
-
Save the folder ID to the
YC_FOLDER
variable and the IAM token to theYC_TOKEN
variable:export YC_FOLDER=<folder_ID> export YC_TOKEN=$(yc iam create-token)
-
Generate a key to connect to a VM over SSH:
ssh-keygen -t ed25519
-
In the
app/todo-service.tf
file, specify the path to the public SSH key; the default value is~/.ssh/id_ed25519.pub
. -
Check the cloud quotas before deploying the required resources.
Information about the new resources
You will create the following resources:
- Virtual Private Cloud network with three subnets in all availability zones.
- Two service accounts:
- One with the
editor
role for managing an instance group. - Another with the
container-registry.images.puller
role for downloading a Docker image to a VM instance.
- One with the
- An instance group of four Container Optimized Image VMs in the
ru-central1-b
andru-central1-d
availability zones. - A Managed Service for PostgreSQL cluster with two hosts in the
ru-central1-b
andru-central1-d
availability zones. - A network load balancer distributing traffic across VM instances in the group.
-
Deploy and run the application:
terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER
Where:
To access the application, navigate to the lb_address
you received after running the terraform apply
command.
Configure and run Load Testing Tool
Warning
Before creating your Load Testing Tool, create TodoList application containers and deploy the infrastructure.
-
Navigate to the Load Testing Tool specification directory:
cd ../tank
-
Initialize Terraform:
terraform init
-
In the
tank/main.tf
file, specify paths to the public and private SSH keys; the default values are~/.ssh/id_ed25519.pub
and~/.ssh/id_ed25519
, respectively. -
Deploy and run the VM:
terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=<overload_token>
Where:
yc_folder
: Folder where you will deploy Load Testing Tool.yc_token
: IAM token of the Load Testing Tool user.overload_token
: Token to connect to<overload.yandex.net>
. To get this token, log in, click your profile at the top right, and select My api token from the drop-down menu.
-
Connect to the new VM over SSH. You can find the connection address in the
terraform apply
command output:ssh <username>@<VM_IP_address>
-
Run Load Testing Tool:
sudo yandex-tank -c load.yaml
-
Navigate to
<overload.yandex.net>
and find your running task there: Public tests → show my tests only.
Running scenarios
VM failure
VM failure is a scenario when the VM with your application is unavailable.
Possible causes:
- Failure of the VM physical host.
- You deleted the VM by mistake.
To simulate this failure, delete one of the VM instances from the group:
- In the management console
, select your instance group folder. - In the list of services, select Compute Cloud.
- In the left-hand panel, select
Instance groups. - Select the
todo-ig
group. - Navigate to the Virtual machines panel.
- Click
→ Delete on the right of the VM you want to delete. - In the window that opens, click Delete.
Test bench reaction:
- The network load balancer and Instance Groups get information about the VM failure and exclude it from load balancing, redirecting its traffic to the remaining instances in the group.
- The Instance Groups service gets automatically restored and:
- Deletes the failed VM instance; in our scenario, the system skips this step because the instance is already deleted.
- Creates a new VM.
- Waits for the application to start on the new VM.
- Adds the new VM to the load balancing group.
The load balancer and Instance Groups need some time to detect the problem and disable traffic to the failed VM. This may cause Connection Timeout
errors: HTTP code 0
in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.
After the system removes the failed VM from the load balancing group, the other VMs take over its load.
Application failure
Application failure is a scenario when your application does not respond in time or works incorrectly.
Possible causes:
- Memory leak
- DB connectivity loss
- Too many requests
According to health check settings, the Instance Groups service polls the grouped VM instances over HTTP. A healthy VM returns a 200
status code in response to the /healthy
request. Otherwise, the Instance Groups service starts the recovery process.
To simulate the yandex-cloud-fault-tolerance-demo-master
repository failure, run this script:
fail_random_host.sh <instance_group_ID>
A random VM instance in the group will start returning a 503
error.
Test bench reaction:
- The Instance Groups service identifies the application failure and excludes the relevant VM instance from load balancing, redirecting its traffic to the remaining instances.
- The Instance Groups service gets automatically restored and:
- Restarts the failed VM.
- Waits for the application to start on the new VM.
- Adds the new VM to the load balancing group.
The Instance Groups service polls the VM several times before disabling traffic and starting the recovery process. This may cause the Service Unavailable
errors: HTTP code 503
in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.
After the system removes the failed VM from the load balancing group, the other VMs take over its load.
Availability zone failure
Availability zone failure is a scenario when multiple VMs in the same zone become unavailable.
Possible causes:
- Data center outage
- Data center maintenance
To move your resources to another data center:
- In the management console
, select your instance group folder. - In the list of services, select Compute Cloud.
- In the left-hand panel, select
Instance groups. - Select the
todo-ig
group. - In the top-right corner, click Edit.
- Under Allocation, uncheck the
ru-central1-b
availability zone. - Click Save.
Test bench reaction:
- The Instance Groups service excludes the
ru-central1-b
availability zone VMs from load balancing. - The system deletes these VMs, creating new VMs in their stead in the
ru-central1-d
zone. - The Instance Groups service adds new VMs to the load balancing group.
The number of VMs that can be created and deleted at one time depends on the deployment policy.
Removing VMs from the load balancing group can cause the Connection Timeout
errors: HTTP code 0
in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.
After the system removes the failed VMs from the load balancing group, the other VMs take over its load.
Application update
To update your application:
- In the management console
, select your instance group folder. - In the list of services, select Compute Cloud.
- In the left-hand panel, select
Instance groups. - Select the
todo-ig
group. - In the top-right corner, click Edit.
- Under Instance template, click
and select Edit. - Under Boot disk image, navigate to the Container Solution tab.
- Select the required Docker container and click
→ Edit. - In the window that opens, specify the application image tagged as
v2
in the Docker image field. - Click Apply.
- Click Save.
- Click Save on the Changing an instance group page.
Test bench reaction:
- The Instance Groups service excludes two VMs with the outdated application from load balancing, assigning them the
RUNNING_OUTDATED
status. - The system deletes these VMs, creating new VMs with the new application version in their stead.
- The Instance Groups service adds new VMs to the load balancing group.
- The system repeats the operations above for two remaining VMs with the outdated application version.
Refresh the application page. If the network load balancer sends your request to an updated VM, you will see the dark theme application version.
The number of VMs that can be created and deleted at one time depends on the deployment policy.
Removing VMs from the load balancing group can cause the Connection Timeout
errors: HTTP code 0
in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.
After the system removes the failed VMs from the load balancing group, the other VMs take over its load.
Scaling your DB
You may need to scale your DB in the following cases:
- Cluster performance is insufficient to handle requests.
- The data requires more storage capacity.
To scale your DB:
- In the management console
, select your DB cluster folder. - In the list of services, select Managed Service for PostgreSQL.
- Select the
todo-postgresql
cluster. - Click
Edit. - Under Host class, select
s2.medium
. - Click Save changes.
Managed Service for PostgreSQL will start updating the cluster.
Switching between master and replica servers at the beginning and end of the update process can cause the Internal Server Error
: HTTP code 500
in the Quantities and HTTP codes charts of the Load Testing Tool monitoring application.
After switching is complete, the cluster will process the requests correctly.
Deleting applications and cleaning up your environment
Warning
If you created your VM with Load Testing Tool, make sure to delete it first; otherwise, deleting the Virtual Private Cloud network will fail.
To delete the Load Testing Tool app, navigate to the yandex-cloud-fault-tolerance-demo-master/terraform/tank
folder and run this command:
terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=not-used
To delete the TodoList application, navigate to the yandex-cloud-fault-tolerance-demo-master/terraform/app
folder and run this command:
terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER