Yandex Cloud
Search
Contact UsGet started
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
  • All Services
  • System Status
    • Featured
    • Infrastructure & Network
    • Data Platform
    • Containers
    • Developer tools
    • Serverless
    • Security
    • Monitoring & Resources
    • AI for business
    • Business tools
  • All Solutions
    • By industry
    • By use case
    • Economics and Pricing
    • Security
    • Technical Support
    • Start testing with double trial credits
    • Cloud credits to scale your IT product
    • Gateway to Russia
    • Cloud for Startups
    • Center for Technologies and Society
    • Yandex Cloud Partner program
  • Pricing
  • Customer Stories
  • Documentation
  • Blog
© 2025 Direct Cursus Technology L.L.C.
Architecture solutions
  • Recommendations on fault tolerance in Yandex Cloud
  • Deploying a web app in a fault-tolerant configuration in Yandex Cloud
  • Testing fault tolerance in Yandex Cloud

In this article:

  • Goals of testing
  • Pre-test preparation
  • Test environment
  • Requirements for tests
  • Testing tools
  • Testing methodology
  • Preparation steps
  • Initiating the test
  • State assessment
  • Completing the test
  • Application Load Balancer
  • Conclusion

Guide for fault tolerance testing in Yandex Cloud

Written by
Yandex Cloud
Updated at July 28, 2025
  • Goals of testing
  • Pre-test preparation
    • Test environment
    • Requirements for tests
    • Testing tools
  • Testing methodology
    • Preparation steps
    • Initiating the test
    • State assessment
    • Completing the test
    • Application Load Balancer
  • Conclusion

This guide describes the practical component of the fault tolerance testing recommendations presented in Recommendations on fault tolerance in Yandex Cloud. It is assumed that the principles behind the subject infrastructure are aligned with the principles discussed in the article.

Goals of testingGoals of testing

This guide describes a cloud availability zone failure exercise methodology allowing you to:

  • Study the system's behavior during failure.
  • Evaluate the system’s fault tolerance when one of the availability zones fails.
  • Identify hidden dependencies and vulnerabilities.
  • Collect information on the symptoms of outage.
  • Check the system's ability to recover quickly.

The failure research is limited to the case of a complete failure of an availability zone. Partial failures fall outside the scope of this guide due to their diversity.

Pre-test preparationPre-test preparation

Test environmentTest environment

  1. Alignment with production environment:

    Warning

    We do not recommend using your production environment for testing; do a test environment exercise first.

    • Your test environment should be closely similar to the production environment in terms of configuration.
    • The test load should simulate the production load. Use load testing tools to simulate the production load, e.g., Yandex Load Testing.
    • We recommend using Infrastructure as Code to automate the setup of test environments.
  2. Follow these best practices to optimize costs when deploying resources in the test environment:

    • Use NRD disks instead of SSD-IO.
    • Use preemptible VMs.
    • Create your resources dynamically only for the duration of the test.
    • Free up resources automatically after the tests are over.
    • Use components without SLA to reduce costs.

Requirements for testsRequirements for tests

  1. Use a monitoring system for assessment of test results.
  2. Save your test results for retrospective analysis.
  3. Perform testing on a regular basis.
  4. Use Yandex Cloud CLI 0.154.0 or higher for testing.

Testing toolsTesting tools

This document addresses fault tolerance testing with the help of a separate availability zone load balancing deactivation tools for Network Load Balancer and Application Load Balancer.

We recommend using VPC security groups as an additional isolation tool for the disabled zone.
Important note: When using VPC security groups, consider the following specifics:

  • Only allowing rules are supported in security groups, so to block traffic you need a separate set of rules allowing traffic between zones, which you will need to delete when you initiate blocking.
  • By deleting the allowing rules from a security group you block new network connections without terminating already established ones.

Testing methodologyTesting methodology

Preparation stepsPreparation steps

  1. Prepare the environment for testing if needed.

  2. Select the availability zone to disable, i.e., to shift traffic away from, e.g., ru-central1-b.

  3. Determine the test duration. You can disable a load balancer zone either permanently or for a specified period, from 1 minute to 72 hours, e.g., 30 minutes.

  4. Get the list of load balancers that will participate in the testing:

    yc load-balancer nlb list
    

Initiating the testInitiating the test

  1. Disable delivery of traffic to the selected availability zone for each load balancer from the list. Use the disable-zones command to disable traffic balancing to the selected zone. To disable balancing of traffic from a separate NLB in the ru-central1-b availability zone for 30 minutes, run the following command:

    yc load-balancer nlb disable-zones enpmq******** --zones=ru-central1-b --duration 30m
    

    The command result should be as follows (note the disable_zone_statuses section):

     id: enpmq********
     ...
     disable_zone_statuses:
       - zone_id: ru-central1-b
         disabled_until: "2025-07-17T15:30:18.176751322Z"
    

    You can use this command to disable several availability zones at once if you list them separated by commas.

    If you run the command again, the blocking period will be reset to 30 minutes from the current time.

    If you do not specify the --duration parameter in the command, traffic balancing to the selected zones will be blocked indefinitely.

    Warning

    The disable-zones command only disables traffic balancing to the selected availability zone and only for the specified load balancer. This command does not impact network traffic within the zone or between the availability zones in any other cloud services. If you need to block traffic on such a broad scale, you can use VPC security groups on the corresponding cloud resource network interfaces.

State assessmentState assessment

  1. To get information about the blocking state of resources behind a specific NLB, run this command:

    yc load-balancer nlb get enpmq********
    

    The command result should be as follows:

     id: enpmq********
     folder_id: b1g********
     created_at: "2025-07-17T13:42:24Z"
     name: ...
     region_id: ru-central1
     status: ACTIVE
     type: EXTERNAL
     listeners:
       - name: ...
         ...
     attached_target_groups:
       - target_group_id: enpn********
         health_checks:
           - name: ...
             ...
     disable_zone_statuses:
       - zone_id: ru-central1-b
         disabled_until: "2025-07-17T15:30:18.176751322Z"
    
  2. Check that traffic has stopped entering the selected zone. You can do this in the monitoring service by plotting total traffic on your virtual machines' interfaces grouped by availability zone.

    Currently, you cannot have zone-by-zone traffic distribution plotted through one simple request to the monitoring service. To get this done:

    1. Create a chart in the monitoring service.
    2. Create lists of VM IDs for zone ru-central1-a, e.g., using this command:
      yc compute instance list --jq '[.[] | select(.zone_id=="ru-central1-a") | .id ] | join("|")'`
      
      The command output will be a single-line list of VM IDs separated by |. For example: fhm**********uv5|fhm**********aab|fhm**********ui1|....
    3. Add a query to the monitoring chart:
      alias(series_sum("network_received_packets"{folderId = "b1g**********", service = "compute", resource_type = "vm", resource_id = "<List of VM IDs you got in the previous step separated by |>"}), "ru-central1-a")`
      
    4. Repeat steps 2 and 3 for zones ru-central1-b and ru-central1-d.
    5. Run the queries.

Completing the testCompleting the test

  1. Run the enable-zones command to resume traffic balancing in the previously disabled availability zone:

    yc load-balancer nlb enable-zones enpmq******** --zones=ru-central1-b
    
  2. Check that traffic has started flowing to the selected zone.

    Remember that there is time limit for re-disabling balancing after it is re-enabled. You have to wait for 2 minutes before you can disable balancing after it was re-enabled.

Application Load BalancerApplication Load Balancer

The syntax and semantics of balancing disable/enable commands for the selected availability zone in Application Load Balancer are the same as in Network Load Balancer.

ConclusionConclusion

We recommend you to perform fault tolerance testing on a regular basis, document the results, and continuously improve your processes based on the experience you gain.

Was the article helpful?

Previous
Deploying a web app in a fault-tolerant configuration in Yandex Cloud
© 2025 Direct Cursus Technology L.L.C.