Configuring a fault-tolerant architecture in Yandex Cloud
In this tutorial, you will learn how to configure a fault-tolerant architecture in Yandex Cloud and check how it works using various test cases.
Fault tolerance is the property that enables a system to continue operating after a failure of one or more of its components.
To configure and test the architecture:
If you no longer need the resources you created, delete them.
Prepare your cloud
Sign up for Yandex Cloud and create a billing account:
- Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one.
If you have an active billing account, you can go to the cloud page
Learn more about clouds and folders.
Required paid resources
The cost for supporting a fault-tolerant Yandex Cloud architecture includes:
- Fee for disks and continuously running VMs (see Yandex Compute Cloud pricing).
- Fee for a continuously running Yandex Managed Service for PostgreSQL cluster (see Managed Service for PostgreSQL pricing).
- Fee for using a dynamic or static public IP address (see Yandex Virtual Private Cloud pricing).
Set up a test bench
Description of the test bench:
-
The application is packaged into a Docker image and pushed to Yandex Container Registry.
Docker images are deployed on four VMs based on a Container Optimized Image. The VMs are grouped and located in two different availability zones.
-
A DB cluster is managed by Managed Service for PostgreSQL and consists of two hosts that reside in different availability zones.
-
The Load Testing Tool application generates the Yandex Cloud Marketplace workload that is applied to Yandex Network Load Balancer. The network load balancer distributes traffic across VMs.
Create TodoList app containers
To prepare the application to run in Yandex Cloud:
-
Download and unpack the repository
with the demo application source code, the Terraform specifications, and a script to simulate the application failure. -
Go to the repository:
cd yandex-cloud-fault-tolerance-demo-master/app
-
Get authenticated in Container Registry:
yc container registry configure-docker
-
yc container registry create --name todo-registry
-
Create a Docker image with the
v1
tag:docker build . --tag cr.yandex/<registry_ID>/todo-demo:v1 --platform linux/amd64
-
Create a Docker image with the
v2
tag (to test the application update case):docker build . --build-arg COLOR_SCHEME=dark --tag cr.yandex/<registry_ID>/todo-demo:v2 --platform linux/amd64
-
Push the Docker images to Container Registry:
docker push cr.yandex/<registry_ID>/todo-demo:v1 docker push cr.yandex/<registry_ID>/todo-demo:v2
Deploy the infrastructure
To prepare the environment for running the application in Yandex Cloud:
-
Go to the directory with the environment specification:
cd ../terraform/app
-
Initialize Terraform in the spec directory:
terraform init
-
Save the folder ID to the
YC_FOLDER
variable and the IAM token to theYC_TOKEN
variable:export YC_FOLDER=<folder_ID> export YC_TOKEN=$(yc iam create-token)
-
Generate a key to connect to a VM via SSH:
ssh-keygen -t ed25519
-
In the file named
app/todo-service.tf
, specify the path to the public SSH key (the default value is~/.ssh/id_ed25519.pub
). -
Check the cloud quotas to be able to deploy the resources you need.
Information about resources to be created
The following resources will be created:
- Virtual Private Cloud network with three subnets in all availability zones.
- Two service accounts:
- Service account for managing a VM group with the
editor
role. - Service account for downloading a Docker image to a VM with the
container-registry.images.puller
role.
- Service account for managing a VM group with the
- VM group of four Container Optimized Image-based VMs in the
ru-central1-b
andru-central1-d
availability zones. - Managed Service for PostgreSQL cluster with two hosts in the
ru-central1-b
andru-central1-d
availability zones. - Network load balancer to distribute traffic between the group's VMs.
-
Deploy and run the application:
terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER
Where:
To access the application, go to the lb_address
received after running the terraform apply
command.
Create and run the Load Testing Tool app
Warning
Before creating your Load Testing Tool app, create TodoList app containers and deploy the infrastructure.
-
Go to the directory with the Load Testing Tool specification:
cd ../tank
-
Initialize Terraform in the directory with the Load Testing Tool specification:
terraform init
-
In the file named
tank/main.tf
, specify the path to the public and private SSH keys (the default values are~/.ssh/id_ed25519.pub
and~/.ssh/id_ed25519
). -
Deploy and run the VM:
terraform apply -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=<overload_token>
Where:
yc_folder
: Folder where Load Testing Tool will be deployed.yc_token
: IAM token of the user to deploy Load Testing Tool.overload_token
: Token to connect to<overload.yandex.net>
. To get the token, log in, click your profile at the top right, and select My api token from the drop-down menu.
-
Connect to the created VM via SSH. The connection address is specified in the
terraform apply
command output:ssh <username>@<VM_IP_address>
-
Run Load Testing Tool:
sudo yandex-tank -c load.yaml
-
Go to
<overload.yandex.net>
and find the running load there: Public tests → show my tests only.
Running scenarios
VM failure
How the failure shows itself: The VM with the application is unavailable.
Possible causes:
- A failure of the physical host that the VM was running on.
- The VM with the application was deleted by mistake.
To simulate the failure, delete one of the instances in the group:
- In the management console
, select the folder with your VM group. - In the list of services, select Compute Cloud.
- In the left-hand panel, select
Instance groups. - Select the
todo-ig
group. - Go to the Virtual machines panel.
- In the row with the VM you need, click
→ Delete. - In the window that opens, click Delete.
Test bench reaction:
- The network load balancer and Instance Groups get information about the VM failure and exclude it from load balancing: traffic stops coming to this VM instance and is distributed across the remaining instances in the group.
- Instance Groups is automatically restored.
- Deletes the unavailable VM instance (in this test case, it is already deleted and this step is skipped).
- Creates a new VM instance.
- Waits for the application to start on the VM instance.
- Adds the VM instance to load balancing.
The load balancer and Instance Groups require some time to detect the problem and disable traffic to the faulty VM instance. This may cause Connection Timeout errors (HTTP code 0
on the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After disabling load balancing for the unavailable VM instance, the user load is handled correctly.
Application failure
How the failure shows itself: The application doesn't respond in time or doesn't work correctly from the user's point of view.
Possible causes:
- A memory leak caused the application to fail.
- The application is unable continue due to DB connectivity loss.
- The application fails to handle requests due to heavy load.
According to health check settings, Instance Groups polls VM instances in the group over HTTP. When operating normally, accessing the /healthy
endpoint returns the HTTP code 200
. Otherwise, Instance Groups starts the recovery procedure.
To simulate the yandex-cloud-fault-tolerance-demo-master
repository failure, run this script:
fail_random_host.sh <VM_group_ID>
A random VM from the group will start returning the HTTP code 503
.
Test bench reaction:
- Instance Groups receives information about the application failure and excludes the VM instance from load balancing: traffic stops coming to this instance and is distributed across the remaining instances in the group.
- Instance Groups is automatically restored.
- Restarts the faulty VM instance.
- Waits for the application to start on the VM instance.
- Adds the VM instance to load balancing.
Instance Groups polls the instance several times before disabling traffic and starting the recovery procedure. This may cause Service Unavailable errors (HTTP code 503
on the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After disabling load balancing for the faulty VM instance, the user load is handled correctly.
Availability zone failure
How the failure shows itself: Multiple VMs are unavailable in the same zone.
Possible causes:
- Data center downtime.
- Scheduled maintenance in the data center.
To move your resources to another data center:
- In the management console
, select the folder with your VM group. - In the list of services, select Compute Cloud.
- In the left-hand panel, select
Instance groups. - Select the
todo-ig
group. - In the top-right corner, click Edit.
- Under Allocation, uncheck the
ru-central1-b
availability zone. - Click Save.
Test bench reaction:
- Instance Groups disables load balancing for the VMs in the
ru-central1-b
availability zone. - These VMs will be deleted while at the same time new VMs will be created in the
ru-central1-d
zone. - Instance Groups adds the created VMs to load balancing.
The number of VM instances that can be simultaneously created and deleted depends on the deployment policy.
When disabling load balancing for VMs, Connection Timeout errors may occur (HTTP code 0
on the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After disabling load balancing for the VM instances, the user load is handled correctly.
Updating an application
To update the application:
- In the management console
, select the folder with your VM group. - In the list of services, select Compute Cloud.
- In the left-hand panel, select
Instance groups. - Select the
todo-ig
group. - In the top-right corner, click Edit.
- Under Instance template, click
and select Edit. - Under Boot disk image, go to the Container Solution tab.
- Select the Docker container you need and click
→ Edit. - In the window that opens, specify the image with the new application version (the
v2
tag) in the Docker image field. - Click Apply.
- Click Save.
- Click Save on the Changing an instance group page.
Test bench reaction:
- Instance Groups disables load balancing for two VMs with the outdated application version (the status of these VMs is
RUNNING_OUTDATED
). - Deletes these VMs while at the same time creating new VMs with the new app version.
- Enables load balancing for the new VMs.
- The actions are repeated for the remaining two VMs with the outdated app version.
Refresh the app page. If the network load balancer sends your request to a VM instance that's already updated, you'll see the app version with a dark color scheme.
The number of VM instances that can be simultaneously created and deleted depends on the deployment policy.
When disabling load balancing for VMs, Connection Timeout errors may occur (HTTP code 0
on the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After disabling load balancing for the VM instances, the user load is handled correctly.
Scaling a DB configuration
You may need to scale your DB if:
- Cluster host performance is insufficient to handle requests.
- The data requires more storage capacity.
To scale the DB configuration:
- In the management console
, select the folder with your DB cluster. - In the list of services, select Managed Service for PostgreSQL.
- Select the
todo-postgresql
cluster. - Click
Edit. - Under Host class, select
s2.medium
. - Click Save changes.
Managed Service for PostgreSQL will run the update command for the cluster.
When switching between the master and replica (at the beginning and end of the update process), an Internal Server Error may occur (HTTP code 500
on the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After switching, the user load is handled correctly.
Deleting applications and environments
Warning
If you had created a VM with the Load Testing Tool, be sure to delete the VM first, otherwise deleting the Virtual Private Cloud network will fail.
To delete the Load Testing Tool app, go to the yandex-cloud-fault-tolerance-demo-master/terraform/tank
folder and run the following command:
terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER -var overload_token=not-used
To delete the TodoList app, go to the yandex-cloud-fault-tolerance-demo-master/terraform/app
folder and run the following command:
terraform destroy -var yc_folder=$YC_FOLDER -var yc_token=$YC_TOKEN -var user=$USER