Configuring a fault-tolerant architecture in Yandex Cloud
In this guide, you'll learn how to configure a fault-tolerant architecture in Yandex Cloud and check how it works using various test cases.
Fault tolerance is the property that enables a system to continue operating after a failure of one or more of its components.
To configure and test the architecture:
If you no longer need these resources, delete them.
Prepare your cloud
Sign up for Yandex Cloud and create a billing account:
- Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one.
If you have an active billing account, you can go to the cloud page
Learn more about clouds and folders.
Required paid resources
The cost for supporting a fault-tolerant Yandex Cloud architecture includes:
- A fee for the disks and the continuously running VMs (see Yandex Compute Cloud pricing).
- A fee for the continuously running cluster Yandex Managed Service for PostgreSQL (see Managed Service for PostgreSQL pricing).
- A fee for using a dynamic or a static public IP (see Yandex Virtual Private Cloud pricing).
Set up a test bench
Description of the test bench
Description of the test bench:
-
The application is packaged into a Docker image and pushed to Yandex Container Registry.
Docker images are deployed on four VMs based on a Container Optimized Image. The VMs are grouped and located in two different availability zones.
-
A DB cluster is managed by Managed Service for PostgreSQL and consists of two hosts that reside in different availability zones.
-
The Load Testing Tool application generates the Yandex Cloud Marketplace workload that is applied to Yandex Network Load Balancer. The network load balancer distributes traffic across VMs.
Create TodoList app containers
To prepare the application to run in Yandex Cloud:
-
Authenticate in Container Registry.
yc container registry configure-docker
-
yc container registry create --name todo-registry
-
Create a Docker image with the
v1
tag:docker build . --tag cr.yandex/<registry_id>/todo-demo:v1
-
Create a Docker image with the
v2
tag (to test the application update case):docker build . --build-arg COLOR_SCHEME=dark --tag cr.yandex/<registry_id>/todo-demo:v2
-
Push the Docker images to Container Registry:
docker push cr.yandex/<registry_id>/todo-demo:v1 docker push cr.yandex/<registry_id>/todo-demo:v2
Deploy the infrastructure
To prepare the environment for running the application in Yandex Cloud:
-
Install Terraform
. -
Download the repository
including the demo application source code, the Terraform specs, and a script to simulate the application failure. -
Go to the directory with the environment specification:
cd app
-
Initialize Terraform in the spec directory:
terraform init
-
In the file named
app/todo-service.tf
, specify the path to the public SSH key (the default value is~/.ssh/id_ed25519.pub
). -
Deploy and run the application:
terraform apply -var yc_folder=<folder_id> -var yc_token=<yc_token> -var user=$USER
Where:
folder_id
: The folder where the application will be deployed.yc_token
: The OAuth-token of the user that you want to deploy the application under.
The following resources are created:
- A Virtual Private Cloud network with three subnets in all availability zones.
- Two service accounts:
- A service account for managing a VM instance group with the
editor
role. - A service account for pulling a Docker image to a VM with the
container-registry.images.puller
role.
- A service account for managing a VM instance group with the
- An instance group of four Container Optimized Image-based VM instances in the
ru-central1-a
andru-central1-b
availability zones. - A Managed Service for PostgreSQL cluster with two hosts in the
ru-central1-a
andru-central1-b
availability zones. - The network load balancer for distributing traffic across VM instances in the instance group.
To access the application, go to the lb_address
received after executing the terraform apply
command.
Create and run the Load Testing Tool app
Warning
Before creating your Load Testing Tool app, create TodoList app containers and deploy the infrastructure.
-
Go to the directory with the Load Testing Tool specification:
cd load-testing-tool
-
Initialize Terraform in the directory with the Load Testing Tool specification:
terraform init
-
In the file named
load-testing-tool/main.tf
, specify the path to the public and private SSH keys (the default values are~/.ssh/id_ed25519.pub
and~/.ssh/id_ed25519
). -
Deploy and run the VM:
terraform apply -var yc_folder=<folder_id> -var yc_token=<yc_token> -var user=$USER -var overload_token=<overload_token>
Where:
folder_id
: The folder where the Load Testing Tool will be deployed.yc_token
: The OAuth token of the user that you want to deploy the Load Testing Tool under.overload_token
: The token that is used to connect to<overload.yandex.net>
. To get the token, log in, click on your profile at the top right, and select My api token from the drop-down menu.
-
Connect to the created VM via SSH. The connection address is specified in the
terraform apply
command result. -
Run the Load Testing Tool:
sudo yandex-tank -c load.yaml
-
Go to
<overload.yandex.net>
and find the running shot: Public tests -> show my tests only.
Running scenarios
VM failure
How the failure shows itself: The VM with the application is unavailable.
Possible cause:
- A failure of the physical host that the VM was running on.
- The VM with the application was deleted by mistake.
To simulate the failure, delete one of the instances in the group.
Test bench reaction:
- The network load balancer and Instance Groups get information about the VM failure and exclude it from load balancing: traffic stops coming to this VM instance and is distributed across the remaining instances in the group.
- Instance Groups is automatically restored.
- Deletes the unavailable VM instance (in this test case, it is already deleted and this step is skipped).
- Creates a new VM instance.
- Waits for the application to start on the VM instance.
- Adds the VM instance to load balancing.
The load balancer and Instance Groups require some time to detect the problem and disable traffic to the faulty VM instance. This may cause Connection Timeout errors (HTTP code 0
in the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After disabling load balancing for the unavailable VM instance, the user load is handled correctly.
Application failure.
How the failure shows itself: The application doesn't respond in time or doesn't work correctly from the user's point of view.
Possible cause:
- A memory leak caused the application to fail.
- The application is unable continue due to DB connectivity loss.
- The application fails to handle requests due to heavy load.
According to health check settings, Instance Groups polls VM instances in the group over HTTP. When operating normally, accessing the /healthy
endpoint returns the HTTP code 200
. Otherwise, Instance Groups starts the recovery procedure.
To simulate the failure, run the script:
fail_random_host.sh <group_id>
A random instance from the group will start returning the HTTP code 503
.
Test bench reaction:
- Instance Groups receives information about the application failure and excludes the VM instance from load balancing: traffic stops coming to this instance and is distributed across the remaining instances in the group.
- Instance Groups is automatically restored.
- Restarts the faulty VM instance.
- Waits for the application to start on the VM instance.
- Adds the VM instance to load balancing.
Instance Groups polls the instance several times before disabling traffic and starting the recovery procedure. This may cause Service Unavailable errors (HTTP code 503
in the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After disabling load balancing for the faulty VM instance, the user load is handled correctly.
Availability zone failure.
How the failure shows itself: Multiple VMs are unavailable in the same zone.
Possible cause:
- Data center downtime.
- Scheduled maintenance in the data center.
To move your resources to another data center:
- In the management console
, select the folder with your instance group. - From the list of services, select Compute Cloud.
- Click Instance groups.
- Select the
todo-ig
group. - Click Edit.
- In the Distribution section, uncheck the
ru-central1-b
availability zone. - Click Save changes.
Test bench reaction:
- Instance Groups disables load balancing for the VMs in the
ru-central1-b
availability zone. - These VMs are deleted and, at the same time, VMs are created in the
ru-central1-a
zone. - Instance Groups adds the created VMs to load balancing.
The number of VM instances that can be simultaneously created and deleted depends on the deployment policy.
While disabling load balancing for VM instances, Connection Timeout errors may occur (HTTP code 0
in the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After disabling load balancing for the VM instances, the user load is handled correctly.
Updating an application
To update the application:
- In the management console
, select the folder with your instance group. - From the list of services, select Compute Cloud.
- Click Instance groups.
- Select the
todo-ig
group. - Click Edit.
- Under Instance template, click
and select Edit. - On the Container Solution tab, select the required Docker container.
- In the Docker image field of the window that opens, enter the name of the Docker image with the new app version.
- Click Apply.
- Click Save.
- Click Save changes.
Test bench reaction:
- Instance Groups disables load balancing for two VM instances with the outdated app version (the status of these instances is
RUNNING_OUTDATED
). - Deletes them while creating VM instances with the new app version.
- Adds the created VM instances to load balancing.
- The actions are repeated for the remaining two instances with the outdated app version.
Refresh the app page. If the network load balancer sends your request to a VM instance that's already updated, you'll see the app version with a dark color scheme.
The number of VM instances that can be simultaneously created and deleted depends on the deployment policy.
While disabling load balancing for VM instances, Connection Timeout errors may occur (HTTP code 0
in the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After disabling load balancing for the VM instances, the user load is handled correctly.
Scaling a DB configuration.
You may need to scale your DB if:
- Cluster host performance is insufficient to handle requests.
- The data requires more storage capacity.
To scale the DB configuration:
- In the (https://console.yandex.cloud) [management console], select the folder with your DB cluster.
- From the list of services, select Managed Service for PostgreSQL.
- Select the
todo-postgresql
cluster. - Click Edit cluster.
- Under Host class, select s2.medium.
- Click Save changes.
Managed Service for PostgreSQL will run the update command for the cluster.
When switching between the master and a replica (at the beginning and end of the update process), an Internal Server Error may occur (HTTP code 500
in the Quantities and HTTP codes charts in the Load Testing Tool monitoring app).
After switching, the user load is handled correctly.
Deleting applications and environments
Warning
If you created your VM with the Load Testing Tool, be sure to delete the VM first, otherwise deleting the Virtual Private Cloud will fail.
To delete the Load Testing Tool app, go to the load-testing-tool
folder and run the command:
terraform destroy -var yc_folder=<folder_id> -var yc_token=<yc_token> -var user=$USER -var overload_token=not-used
To delete the TodoList app, go to the app
directory and run the following command:
terraform destroy -var yc_folder=<folder_id> -var yc_token=<yc_token> -var user=$USER