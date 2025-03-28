Recommendations on fault tolerance in Yandex Cloud
- Introduction
- Resource placement
- Tools for ensuring fault tolerance
- Load balancers (Network Load Balancer, Application Load Balancer)
- Fault tolerance of platform services
- Autoscaling tools
- Fault tolerance of client services
- How to shift load from an availability zone
- Monitoring and escalation
- Action plan
- Testing fault tolerance
- See also
Fault tolerance is the capability of a system to continue its operation in case of any fault in one or multiple components.
Faults can be either total or partial. A partial fault is an intermediate state between full operability and a total fault, manifested by a partial rather than entire loss of the system’s capability to perform its functions. Example: 50% loss of network packages during transmission via communication circuits is a partial fault.
Below are recommendations on designing a fault-tolerant infrastructure in Yandex Cloud.
Introduction
Yandex Cloud infrastructure divides into regions and availability zones. An availability zone is an isolated part of the infrastructure that is protected from faults in other zones. Availability zones are organized territorially and located around 300 km away from each other.
Currently, the following regions are available in Yandex Cloud:
- Russia (ru-central1):
ru-central1-a,
ru-central1-b, and
ru-central1-davailability zones.
- Kazakhstan (kz1):
kz1-aavailability zone.
A region provides direct network (IP) connectivity across availability zones, common APIs, SLAs, and unified pricing for cloud services.
Here are the possible fault types regarding which this document provides recovery recommendations:
- Short (hours-long) total or partial outage of one availability zone.
- Short partial fault of API services.
To build fault-tolerant services in Yandex Cloud, factor in the platform’s architectural specifications: availability zones and nuances of cloud tools for building fault-tolerant systems.
Resource placement
To ensure fault-tolerance, you must deploy your services in multiple availability zones. The best option is three-zone placement, as it allows the operation of quorum algorithms used in, e.g., managed databases (MDBs) and Managed Service for Kubernetes (mK8s).
Warning
High-performance environments with demanding requirements for fault recovery speed must be placed in exactly three availability zones.
Reservation schemes
Depending on requirements for fault recovery time, two basic schemes are available:
-
Cold reserve (active-passive):
- The bulk of the load is processed in one availability zone.
- A minimal number of resources, e.g., DB replicas, is placed in another zone for a quick start in case of a main zone fault.
- This scheme is simple and less costly, since it does not require continuous duplication of all resources.
- It does not guarantee continuous availability of the service due to the long time it takes to switch to the standby infrastructure.
-
Load balancing (Active-Active):
- The load is distributed among multiple zones (see the scheme below).
- This sheme requires software adaptation and minimizing network delays across the zones (e.g., traffic localization).
- The scheme shows high fault tolerance and minimum recovery time.
Delay minimization is one of the subjects in the Quest for microseconds: Optimizing cloud service performance webinar.
Fault handling
When deploying fault-tolerant services, we recommend you to use load balancers. In Yandex Cloud, load balancers are crucial to reducing or eliminating fault impact on service performance. Load balancers consist of two central parts:
- Listener: Element that receives traffic and distributes it among the target resources.
- Target resources: Groups of resources receiving traffic from the listener.
For both reservation schemes described above, you can only minimize fault handling time using the automatic mechanism tracking the state of target resources and redirecting user requests from the listener only to those target resources that are ready to handle such requests. You can find out if a target resource is ready to handle requests using a health check. The most difficult part of configuring it is selecting the appropriate check values and correctly performing readiness checks on the target resource end.
Also, note that the automatic mechanism of availability checks may not be triggered in case of partial faults in the availability zone. To recover a service in case of such faults, you must provide a mechanism for manual redistribution of load from the failed zone to healthy ones.
To minimize fault handling time, especially in case of API faults, it is essential to make sure each zone has enough computing resources. This will allow you to use the capacity of the remaining healthy zones for handling the estimated load if an availability zone fails. We recommend you to have at least a 50% reserve above the estimated load for resources in each zone (see the diagram below).
Tools for ensuring fault tolerance
-
Load balancers:
-
Platform services:
- Managed databases (MDBs)
- Managed Service for Kubernetes in a fault-tolerant configuration
-
Natively fault-tolerant cloud services:
-
Autoscaling tools:
Load balancers (Network Load Balancer, Application Load Balancer)
Network Load Balancer
The central tool for building fault-tolerant solutions in Yandex Cloud is a network load balancer (Network Load Balancer) that disributes TCP connections among target resources. It can be external, for processing traffic from the internet (listener with a public IP address), or internal, for processing internal network traffic (listener with a private IP address). Health checks are used to check the readiness of target resources. Currently, Network Load Balancer does not support disabling traffic in a specific zone.
Here is an example of creating a fault-tolerant website with load balancing using Network Load Balancer between two availability zones with fault protection in one zone.
Application Load Balancer
Application Load Balancer is a smarter yet more costly balancing tool. It supports integration with protection services, such as Smart Web Security, ARL, WAF, and SmartCaptcha. Application Load Balancer allows disabling resources in a specific zone and localizing traffic inside an availability zone.
Here is an example of creating a fault-tolerant website with load balancing using Application Load Balancer between three availability zones with fault protection in one zone.
Fault tolerance of platform services
Placement of platform service hosts in different availability zones is the key method of achieving fault tolerance.
Fault tolerance of managed databases (MDBs)
Under the SLA, a fault-tolerant configuration is a configuration of a
DB cluster consisting of two or more DB hosts located in different availability zones. The best practice is to place DB cluster nodes in three availability zones, since it is systems based on quorum algorithms that are used for ensuring fault tolerance.
Warning
High-performance environments with demanding requirements for fault recovery speed must be placed in exactly three availability zones.
In case the DB master fails, the automatic mechanism of the service initiates switching to another host. In certain cases, the automatic mechanism of the DB service cannot initiate master switching during a fault. In this case, you must switch manually, e.g., using the
yc command. Here is an example for a PostgreSQL cluster:
yc managed-postgresql cluster start-failover <cluster_name> --host <host_name>
To allow a client to connect to the current DB master anytime without requesting the cluster state from the API, Yandex Cloud provides special FQDNs. Connecting over a special FQDN simplifies application coding but does not guarantee quick switching to a new master in case it is replaced. To quickly switch to a new master, you need to ensure, on the application end, monitoring the master replacement and reconnection.
Currently, Yandex Cloud does not have a service automatically balancing reading load between DB cluster nodes. Methods of such balancing are discussed in the Quest for microseconds: Optimizing cloud service performance webinar.
Managed Service for Kubernetes fault tolerance
Under the SLA, Kubernetes clusters are only deemed fault-tolerant if they use a
master with fault tolerance settings in three availability zones (one host per zone).
To build a fault-tolerant infrastructure, beyond the cluster itself, you need to:
- Place cluster worker nodes in multiple availability zones.
- Distribute maintenance services, such as CoreDNS, and application services among multiple availability zones.
To minimize the impact of cluster node faults, you need to ensure even load distribution. To do this, we recommend using such tools as:
topologySpreadConstraints: To ditribute pods among availability zones.
podAntiAffinity: To prevent placement of pods on a single node.
To reduce service downtime during cluster updates, you need to set
podDisruptionBudget policies.
Autoscaling tools
In case one of the availability zones fails, you need to redistribute the load among the other zones. If you are using
cold reserve (active-passive) fault tolerance, you can reduce recovery time by ensuring resource autoscaling.
The central scaling tool in Yandex Cloud is an instance group. An instance group includes:
- Instance template
- Scaling policy (manual or automatic)
- Scaling mechanism
Here is an example of deploying an instance group with an autoscaling policy for managing extra load.
For autoscaling, you can use any parameter from Yandex Cloud Monitoring in addition to the basic parameter (CPU load).
Recommendations on ensuring tolerance against zone faults:
- Use a separate instance group for each availability zone. Avoid using the same instance group for creating instances in various availability zones, since it can complicate managing them in case one of the zones fails.
- Similarly, autoscaling Kubernetes cluster node groups is also based on instance groups.
Warning
Kubernetes cluster node groups cannot use the entire functionality of instance groups.
Note
When designing a fault-tolerant cloud infrastructure, note that in case one of the availability zones fails, the available resources in other zones will be depleted much faster.
Fault tolerance of client services
To ensure fault tolerance and quick fault handling in Managed Service for Kubernetes applications:
- Allocate enough resources (CPU, RAM) for the service.
- Minimize or eliminate resubscription of resources on the worker nodes of a Managed Service for Kubernetes cluster, especially of RAM.
- Set up correct health checks.
- Apply a retry policy to the provider’s services.
- Set up autoscaling of cluster worker nodes for automatic resource redistribution in case of an unexpected load increase or fault in one of the availability zones.
How to shift load from an availability zone
Application Load Balancer supports manual disabling of traffic in a specific zone.
For Network Load Balancer, you can only remove traffic from an availability zone by disabling health checks for target resources in the faulty zone. There are several ways to do this:
- At the infrastructure level, block checks at the network security group level.
- Disable instances that handle requests in the faulty zone.
- At the operating system level, restrict access to checks using a firewall.
- At the application level, configure the application in such a way that it would not respond to health checks.
We recommend using network security groups. To do this, you need to set up separate rules allowing to run health checks on target resources in each availability zones. Deleting a rule allows you to disable traffic in a certain zone. This type of configuration enables you to use network security groups for testing fault tolerance.
You should consider the other methods in case the Yandex Cloud API is unavailable.
Monitoring and escalation
Monitoring and alerts are key tools for ensuring fault tolerance. Beyond the basic monitoring tools provided with the cloud services, it is crucial to configure monitoring of business metrics. For example, tracking the number of service users for the recent minutes allows you to detect problems at a high level, even with their source in the infrastructure not tracked.
For quick notification about problems, you need to configure an escalation policy, which is currently at the preview stage, in addition to monitoring.
Action plan
To quickly restore a service and address faults, you need to have action plans ready in advance, such as switching the DB master manually or disabling an availability zone.
Testing fault tolerance
Any fault tolerance solutions require regular testing in various fault scenarios. Learn more about testing fault tolerance in a cloud from this webinar: Disabling a data center, or How to test fault tolerance in a cloud.