Fault tolerance testing in the Yandex Cloud infrastructure based on Yandex Network Load Balancer
This guide covers the practical aspects of the fault tolerance testing routine outlined in Recommendations on fault tolerance in Yandex Cloud, for the Yandex Cloud infrastructure based on Yandex Network Load Balancer. It is assumed that the principles behind the subject infrastructure are aligned with the principles discussed in the article.
Goals of testing
This guide describes a cloud availability zone failure exercise methodology allowing you to:
- Study the system's behavior during failure.
- Evaluate the system’s fault tolerance when one of the availability zones fails.
- Identify hidden dependencies and vulnerabilities.
- Collect information on the symptoms of outage.
- Check the system's ability to recover quickly.
The failure research is limited to the case of a complete failure of an availability zone. Partial failures fall outside the scope of this guide due to their diversity.
Pre-test preparation
Test environment
-
Alignment with production environment:
Warning
We do not recommend using your production environment for testing; do a test environment exercise first.
- We recommend making your test environment closely similar to the production environment in terms of configuration.
- The test load should resemble the production workload. You can use load testing tools to simulate the production load, e.g., Yandex Load Testing.
- We recommend using Infrastructure as Code to automate the setup of test environments.
-
Follow these best practices to optimize costs when deploying resources in the test environment:
- Use NRD disks instead of SSD-IO.
- Use preemptible VMs.
- Create your resources dynamically only for the duration of the test.
- Free up resources automatically after the tests are over.
- Use components without SLA to reduce costs.
Testing recommendations
- Use a monitoring system for assessment of test results.
- Save your test results for retrospective analysis.
- Perform testing on a regular basis.
- Use Yandex Cloud CLI
0.154.0or higher for testing.
Testing tools
This guide describes fault tolerance tests implemented using tools that disable load balancing in a particular availability zone for Network Load Balancer.
We recommend using VPC security groups as an additional isolation tool for the disabled zone.
Important note: When using VPC security groups, consider the following specifics:
- Security groups support allowing rules only; therefore, to block traffic you need a separate set of rules that allow traffic between zones. To implement blocking, these rules will have to be deleted.
- By deleting the allowing rules from a security group you block new network connections without terminating the existing ones.
Testing methodology
Preparation steps
-
If required, prepare the environment for testing.
-
Select the availability zone to disable, i.e., to shift traffic away from, e.g.,
ru-central1-b. -
Determine the test duration. You can disable a load balancer zone either permanently or for a specified period, from 1 minute to 72 hours, e.g., 30 minutes.
-
Get the list of load balancers that will participate in the testing:
yc load-balancer nlb list
Initiating the test
Disable delivery of traffic to the selected availability zone for each load balancer from the list. Use the disable-zones command to disable traffic balancing to the selected zone.
To disable traffic balancing in the ru-central1-b availability zone for a specific load balancer for 30 minutes, run this command:
yc load-balancer nlb disable-zones <load_balancer_name_or_ID> \
--zones=ru-central1-b \
--duration 30m
Approximate result of executing the command (pay attention to disable_zone_statuses):
id: enpmq********
...
disable_zone_statuses:
- zone_id: ru-central1-b
disabled_until: "2025-07-17T15:30:18.176751322Z"
You can use this command to disable several availability zones at once if you list them separated by commas.
If you run the command again, the blocking period will be reset to 30 minutes from the current time.
If you do not specify the --duration parameter in the command, traffic balancing to the selected zones will be blocked indefinitely.
Warning
The disable-zones command only disables traffic balancing to the selected availability zone and only for the specified load balancer. This command does not impact network traffic within the zone or between the availability zones in any other cloud services. If you need to block traffic on such a broad scale, you can use VPC security groups on the corresponding cloud resource network interfaces.
State assessment
-
To get the resource blocking state info for an individual load balancer, run the following command:
yc load-balancer nlb get <load_balancer_name_or_ID>Approximate result of executing the command:
id: enpmq******** folder_id: b1g******** created_at: "2025-07-17T13:42:24Z" name: ... region_id: ru-central1 status: ACTIVE type: EXTERNAL listeners: - name: ... ... attached_target_groups: - target_group_id: enpn******** health_checks: - name: ... ... disable_zone_statuses: - zone_id: ru-central1-b disabled_until: "2025-07-17T15:30:18.176751322Z" -
Make sure traffic has stopped entering the selected zone. You can do this in the monitoring service by plotting total traffic on your virtual machines' interfaces grouped by availability zone.
Currently, you cannot have zone-by-zone traffic distribution plotted through one simple request to the monitoring service. To get this done:
- Create a chart in the monitoring service.
- Create lists of VM IDs for the
ru-central1-azone, e.g., using this command:The command output will be a single-line list of VM IDs separated byyc compute instance list --jq '[.[] | select(.zone_id=="ru-central1-a") | .id ] | join("|")'`|. For example:fhm**********uv5|fhm**********aab|fhm**********ui1|.... - Add a query to the monitoring chart:
alias(series_sum("network_received_packets"{folderId = "b1g**********", service = "compute", resource_type = "vm", resource_id = "<delimiter-separated_list_of_VM_IDs_from_previous_step_|>"}), "ru-central1-a")` - Repeat steps 2 and 3 for zones
ru-central1-bandru-central1-d. - Run the queries.
Completing the test
-
To resume traffic balancing in a previously disabled availability zone, run this
enable-zonescommand:yc load-balancer nlb enable-zones <load_balancer_name_or_ID> \ --zones=ru-central1-b -
Make sure that traffic has started flowing to the selected availability zone.
Remember that there is time limit for re-disabling balancing after it is re-enabled. You have to wait for two minutes before you can disable balancing after it was re-enabled.
Conclusion
We recommend you to perform fault tolerance testing on a regular basis, document the results, and continuously improve your processes based on the experience you gain.