Recommendations on fault tolerance in Yandex Cloud
- Introduction
- Resource placement
- Tools for ensuring fault tolerance
- Load balancers (Network Load Balancer, Application Load Balancer)
- Fault tolerance of platform services
- Autoscaling tools
- Fault tolerance of client services
- How to shift load from an availability zone
- Monitoring and escalation
- Action plan
- Testing fault tolerance
- See also
Fault tolerance is the capability of a system to continue its operation in case of any fault in one or multiple components.
Faults can be either total or partial. A partial fault is intermediate between a fully operational state and a total fault, manifested by a partial rather than full loss of the system’s capacity to perform its functions. For example: 50% loss of network packages during transmission via communication circuits is a partial fault.
Below are recommendations on designing a fault-tolerant infrastructure in Yandex Cloud.
Introduction
Yandex Cloud infrastructure divides into regions and availability zones. An availability zone is an isolated part of the infrastructure that is protected from faults in other zones. Availability zones are organized territorially and located around 300 km away from each other.
Currently, the following regions are available in Yandex Cloud:
- Russia (ru-central1):
ru-central1-a
,ru-central1-b
, andru-central1-d
availability zones. - Kazakhstan (kz1):
kz1-a
availability zone.
A region provides direct network (IP) connectivity across availability zones, common APIs, SLAs, and unified pricing for cloud services.
Here are the possible fault types regarding which this document provides recovery recommendations:
- Short (hours-long) total or partial outage of one availability zone.
- Short partial fault of API services.
To build fault-tolerant services in Yandex Cloud, factor in the platform’s architectural specifications: availability zones and nuances of cloud tools for building fault-tolerant systems.
Resource placement
To ensure fault-tolerance, you must deploy your services in multiple availability zones. The best option is three-zone placement, as it allows the operation of quorum algorithms used in, e.g., managed databases (MDBs) and Managed Service for Kubernetes (mK8s).
Warning
High-performance environments with demanding requirements for fault recovery speed must be placed in exactly three availability zones.
Reservation schemes
Depending on fault recovery time requirements, two basic schemes are available:
-
Cold reserve (active-passive):
- The bulk of the load is processed in one availability zone.
- A minimal number of resources, e.g., DB replicas, is placed in another zone for a quick start in case of a main zone fault.
- This scheme is simple and less costly, since it does not require continuous duplication of all resources.
- It does not guarantee continuous availability of the service due to the long time it takes to switch to the standby infrastructure.
-
Load balancing (Active-Active):
- The load is distributed among multiple zones (see the scheme below).
- This scheme requires software adaptation and minimizing network delays between the zones (e.g., locality aware routing).
- The scheme shows high fault tolerance and minimum recovery time.
Delay minimization is one of the subjects in the Quest for microseconds: Optimizing cloud service performance webinar.
Fault handling
When deploying fault-tolerant services, we recommend you to use load balancers. In Yandex Cloud, load balancers are crucial to reducing or eliminating fault impact on service performance. Load balancers consist of two central parts:
- Listener: Element that receives traffic and distributes it among the targets.
- Targets: Groups of resources receiving traffic from the listener.
For both reservation schemes described above, you can only minimize failover time automatically by tracking the state of targets and redirecting user requests from the listener only to those targets that are ready to handle such requests. You can find out if a target is ready to accept requests through a health check. The most difficult part of configuring it is selecting the appropriate check values and correctly performing the readiness checks on the target side.
Also, note that the automatic mechanism of availability checks may not be triggered in case of partial faults in the availability zone. To recover a service in case of such faults, you must provide a mechanism for manual redistribution of load from the failed zone to healthy ones.
To minimize fault handling time, especially in case of API faults, it is essential to make sure each zone has enough computing resources. If one availability zone fails, this will allow you to use the capacity of the remaining operational zones to support the estimated load. We recommend you to have at least a 50% reserve above the estimated load for resources in each zone (see the diagram below).
Tools for ensuring fault tolerance
-
Load balancers:
-
Platform services:
- Managed databases (MDBs)
- Managed Service for Kubernetes in a fault-tolerant configuration
-
Natively fault-tolerant cloud services:
-
Autoscaling tools:
Load balancers (Network Load Balancer, Application Load Balancer)
Network Load Balancer
The central tool for building fault-tolerant solutions in Yandex Cloud is a network load balancer (Network Load Balancer), which distributes TCP connections among targets. It can be either external, for processing traffic from the internet (listener with a public IP address), or internal, for processing internal network traffic (listener with a private IP address). Targets are checked for readiness using health checks. Currently, Network Load Balancer does not support disabling traffic in a specific zone.
We recommend checking the targets for readiness frequently enough with an interval of under three seconds. The health check trigger thresholds must be strictly greater than 1. To avoid increased load on targets, the health checks must not require much resources to generate a response. Example of poor practice: requesting the website root page for a health check. Example of good practice: using a separate URI to check connections to the targets of interest (e.g., databases) and overall operability.
Here is an example of creating a fault-tolerant website with load balancing using Network Load Balancer between two availability zones with fault protection in one zone.
Application Load Balancer
Application Load Balancer is a smarter yet more costly balancing tool. At the architecture level, the service is a Network Load Balancer that distributes network traffic between resource units, i.e., internal VMs acting as reverse proxies
To handle situations with partial availability zone faults, with Application Load Balancer, you can manually stop delivering customer traffic to the compromised zone.
To reduce system recovery time, we recommend you to keep resource units in all your availability zones and reserve enough resource units in each one. This gives you some reserve capacity if one of the zones goes down. You can use the --min-zone-size
parameter to specify the minimum number of resource units per availability zone.
For reliable operation of the fault tolerance mechanisms, Application Load Balancer has to be supplied with information about the availability zone the targets are in. If you use integration with instance groups, this is done automatically; otherwise, you must specify the ID of the subnet the target resides in. For more information, see this guide.
To minimize delays during request processing, locality aware routing should be employed: a request reaching a resource unit in an availability zone must be processed in the same availability zone. To do this, set the locality_aware_routing_percent parameter for the backend group to 100%. This will prioritize traffic delivery to the current availability zone while still keeping it possible to send requests to other availability zones if no targets are available. We do not recommend enabling Strict localization, as it stops the processing of requests ending up in the availability zone without any targets available.
The recommendations for Application Load Balancer target availability checks are the same as for Network Load Balancer.
You can give Application Load Balancer extra resilience to faults related to malicious activities by connecting to it such web application protection services as Smart Web Security, ARL, WAF, and SmartCaptcha.
Here is an example of creating a fault-tolerant website with load balancing using Application Load Balancer between three availability zones with fault protection in one zone.
Fault tolerance of platform services
Placement of platform service hosts in different availability zones is the key method of achieving fault tolerance.
Fault tolerance of managed databases (MDBs)
Under the SLADB cluster consisting of two or more DB hosts located in different availability zones
. The best practice is to place DB cluster nodes in three availability zones, since it is systems based on quorum algorithms that are used for ensuring fault tolerance.
Warning
High-performance environments with demanding requirements for fault recovery speed must be placed in exactly three availability zones.
In case the DB master fails, the automatic mechanism of the service initiates switching to another host. In certain cases, the automatic mechanism of the DB service cannot initiate master switching during a fault. In this case, you must switch manually, e.g., using the yc
command. Here is an example for a PostgreSQL cluster:
yc managed-postgresql cluster start-failover <cluster_name> --host <host_name>
To allow a client to connect to the current DB master anytime without requesting the cluster state from the API, Yandex Cloud provides special FQDNs. Connecting over a special FQDN simplifies application coding but does not guarantee quick switching to a new master in case it is replaced. To quickly switch to a new master, you need to ensure, on the application end, monitoring the master replacement and reconnection.
Currently, Yandex Cloud does not have a service automatically balancing reading load between DB cluster nodes. Methods of such balancing are the subject in the Quest for microseconds: Optimizing cloud service performance webinar.
Managed Service for Kubernetes fault tolerance
Under the SLAmaster with fault tolerance settings in three availability zones (one host per zone)
.
To build a fault-tolerant infrastructure, beyond the cluster itself, you need to:
- Place cluster worker nodes in multiple availability zones.
- Distribute maintenance services, such as CoreDNS, and application services among multiple availability zones.
To minimize the impact of cluster node faults, you need to ensure even load distribution. To do this, we recommend using such tools as:
topologySpreadConstraints
: To ditribute pods among availability zones.podAntiAffinity
: To prevent placement of pods on a single node.
To reduce service down time during cluster updates, you need to set podDisruptionBudget
policies.
Autoscaling tools
In case one of the availability zones fails, you need to redistribute the load among the other zones. If you are using cold reserve (active-passive)
fault tolerance, you can reduce recovery time by ensuring resource autoscaling.
The central scaling tool in Yandex Cloud is an instance group. An instance group includes:
- Instance template
- Scaling policy (manual or automatic)
- Scaling mechanism
Here is an example of deploying an instance group with an autoscaling policy for managing extra load.
For autoscaling, you can use any Yandex Cloud Monitoring parameter in addition to the basic parameter (CPU load).
Recommendations on ensuring tolerance against zone faults:
- Use a separate instance group for each availability zone. Avoid using the same instance group to create instances in different availability zones: it can complicate their management should one of the zones fail.
- Kubernetes cluster node group autoscaling is also based on the instance group mechanics.
Warning
Kubernetes cluster node groups cannot use the entire functionality of instance groups.
Note
When designing a fault-tolerant cloud infrastructure, note that in case one of the availability zones fails, the available resources in other zones will be depleted much faster.
Fault tolerance of client services
To ensure fault tolerance and quick fault handling in Managed Service for Kubernetes applications:
- Allocate enough resources (CPU, RAM) for the service.
- Minimize or eliminate resubscription of resources on the worker nodes of a Managed Service for Kubernetes cluster, especially of RAM.
- Set up correct health checks.
- Apply a retry policy to the provider’s services.
- Set up autoscaling of cluster worker nodes for automatic resource redistribution in case of an unexpected load increase or fault in one of the availability zones.
How to shift load from an availability zone
Application Load Balancer supports manual disabling of traffic in a specific zone.
For Network Load Balancer, you can only remove traffic from an availability zone by disabling health checks for targets in the faulty zone. There are several ways to do this:
- At the infrastructure level, block checks at the network security group level.
- Disable instances that handle requests in the faulty zone.
- At the operating system level, restrict access to checks using a firewall.
- At the application level, configure the application in such a way that it would not respond to health checks.
We recommend using network security groups. To do this, you need to configure separate rules allowing availability checks up to the targets in each availability zone. Deleting a rule allows you to disable traffic in a certain zone. This type of configuration enables you to use network security groups for fault tolerance testing.
You should consider the other methods in case the Yandex Cloud API is unavailable.
Application high availability testing
To analyze an application for fault tolerance, i.e., its ability to process traffic if one of its availability zones fails, you can use this ready-made scenario
You can test your web applications using this technique, if required.
NLB tagging to disconnect an availability zone
We are introducing the NLB Zone Shift
mechanism to better respond to partial failure incidents.
After an application is successfully tested for fault tolerance, you can tag the relevant NLB with a special flag. This flag will enable Yandex Cloud support to cut traffic from the load balancer in response to partial failures in one of its availability zones not captured by regular target health checks, e.g., in the event of external communication circuit issues.
To tag an NLB with a zone shift flag, run this YC CLI command:
yc load-balancer network-load-balancer update <nlb-id> --allow-zonal-shift
Monitoring and escalation
Monitoring and alerts are key tools for ensuring fault tolerance. Beyond the basic monitoring tools provided with the cloud services, it is crucial to configure monitoring of business metrics. For example, tracking the number of service users for the recent minutes allows you to detect problems at a high level, even with their source in the infrastructure not tracked.
For quick issue reporting, in addition to monitoring, you should configure an escalation policy (currently at the Preview stage).
Action plan
To quickly restore a service and address faults, you need to have action plans ready in advance, such as switching the DB master manually or disabling an availability zone.
Testing fault tolerance
Any fault tolerance solutions require regular testing in various fault scenarios. Learn more about testing fault tolerance in a cloud from this webinar: Disabling a data center, or How to test fault tolerance in a cloud.