Implementing fault-tolerant scenarios for network VMs
In Yandex Cloud, you can deploy a cloud infrastructure using network VMs that provide firewall protection, network security, and traffic routing. With static routing, traffic is routed from subnets to network VMs.
To ensure high availability, you can deploy multiple network VMs in different availability zones and set up auto switching of outgoing subnet traffic from one network VM to another using the route-switcher module
This tutorial describes a use case when the route-switcher module provides fault tolerance of a NAT instance, a network VM with preset routing and IP address translation rules. NAT instances are used to provide internet access for VMs and other cloud resources hosted in Yandex Cloud.
In the flow chart used in this example, a NAT instance called NAT-A
is the main VM instance for traffic to the internet, while NAT-B
is a standby one.
Description of scheme elements
Element name | Description |
---|---|
NAT-A, NAT-B | NAT instances that provide cloud resource access to the internet through translation of the resources' internal IP addresses to the NAT instances' public IPs. |
VPC: demo | Virtual Private Cloud network |
private-a | Subnet in the ru-central1-a availability zone for hosting resources that require access to the internet. |
public-a, public-b | Subnets in the ru-central1-a and ru-central1-b availability zones hosting the NAT instances. |
public ip a, public ip b | NAT instance public IPs. |
NLB | Internal network load balancer required for the route-switcher module to run; it checks whether the NAT instances are available by performing health checks on port TCP 22. |
If NAT-A
fails, the route-switcher will switch outgoing traffic over to NAT-B
by changing the Next hop
value to the NAT-B
internal IP address in the subnet route table. After that, internet access will be provided through NAT-B
.
As soon as NAT-A
recovers, the route-switcher will reroute outgoing traffic through NAT-A
by changing the Next hop
value to the NAT-A
instance internal IP address in the route table.
This tutorial will help you create a test infrastructure that shows how the route-switcher module works. The solution has the following basic elements:
- nat-a: Main NAT instance.
- nat-b: Standby NAT instance.
- test-vm: VM within the infrastructure's internal perimeter that internet access through the respective NAT instance is provided for.
- route-switcher-lb-...: Network load balancer required for the route-switcher module to run and used to check if the NAT instances are available.
- route-switcher-...: Cloud function that switches outgoing traffic over to the standby NAT instance if the primary one is down.
To deploy the test infrastructure and test the route-switcher:
- Prepare your cloud.
- Prepare the environment.
- Deploy your resources.
- Enable the route-switcher module.
- Test the solution for performance and fault tolerance.
If you no longer need the resources you created, delete them.
Prepare your cloud
Sign up for Yandex Cloud and create a billing account:
- Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one.
If you have an active billing account, you can go to the cloud page
Learn more about clouds and folders.
Required paid resources
The infrastructure support cost includes:
- Fee for continuously running VMs (see Yandex Compute Cloud pricing).
- Fee for using Network Load Balancer (see Yandex Network Load Balancer pricing).
- Fee for using public IP addresses and outgoing traffic (see Yandex Virtual Private Cloud pricing).
- Fee for using the function (see Yandex Cloud Functions pricing).
Configure the CLI profile
-
If you do not have the Yandex Cloud command line interface yet, install it and sign in as a user.
-
Create a service account:
Management consoleCLIAPI- In the management console
, select the folder where you want to create a service account. - In the list of services, select Identity and Access Management.
- Click Create service account.
- Enter a name for the service account, e.g.,
sa-terraform
. - Click Create.
The folder specified in the CLI profile is used by default. You can specify a different folder using the
--folder-name
or--folder-id
parameter.Run the command below to create a service account, specifying the
sa-terraform
name:yc iam service-account create --name sa-terraform
Where
name
is the service account name.Result:
id: ajehr0to1g8b******** folder_id: b1gv87ssvu49******** created_at: "2023-06-20T09:03:11.665153755Z" name: sa-terraform
To create a service account, use the ServiceAccountService/Create gRPC API call or the create REST API method for the
ServiceAccount
resource. - In the management console
-
Assign the service account the administrator role for the folder:
Management consoleCLIAPI- On the management console home page
, select a folder. - Go to the Access bindings tab.
- Find the
sa-terraform
account in the list and click . - Click Edit roles.
- Click Add role in the dialog box that opens and select the
admin
role.
Run this command:
yc resource-manager folder add-access-binding <folder_ID> \ --role admin \ --subject serviceAccount:<service_account_ID>
To assign the service account a role for the folder, use the setAccessBindings REST API method for the ServiceAccount resource or the ServiceAccountService/SetAccessBindings gRPC API call.
- On the management console home page
-
Set up the CLI profile to run operations on behalf of the service account:
CLI-
Create an authorized key for the service account and save it to the file:
yc iam key create \ --service-account-id <service_account_ID> \ --folder-id <ID_of_folder_with_service_account> \ --output key.json
Where:
service-account-id
: Service account ID.folder-id
: ID of the folder in which the service account was created.output
: Name of the file with the authorized key.
Result:
id: aje8nn871qo4******** service_account_id: ajehr0to1g8b******** created_at: "2023-06-20T09:16:43.479156798Z" key_algorithm: RSA_2048
-
Create a CLI profile to run operations on behalf of the service account:
yc config profile create sa-terraform
Result:
Profile 'sa-terraform' created and activated
-
Set the profile configuration:
yc config set service-account-key key.json
Where:
service-account-key
: File with the service account authorized key. -
Add the credentials to the environment variables:
export YC_TOKEN=$(yc iam create-token)
-
Prepare an environment for deploying the resources
-
Install Git
using the following command:sudo apt install git
-
Clone the
yandex-cloud-examples/yc-route-switcher
GitHub repository and go to the script folder:git clone https://github.com/yandex-cloud-examples/yc-route-switcher.git cd yc-route-switcher/examples
-
Open the
terraform.tfvars
file, for example, using thenano
editor:nano terraform.tfvars
-
Edit the following:
-
String with the folder ID:
folder_id = "<folder_ID>"
-
The line with a list of allowed public IP addresses for
test-vm
access:trusted_ip_for_mgmt = ["<workstation_external_IP_address>/32"]
Where:
<workstation_external_IP_address>
is your workstation's public IP address.To find out the external IP of your workstation, run:
curl 2ip.ru
Result:
192.240.24.87
-
Deploy your resources
-
Initialize Terraform:
terraform init
-
Check the Terraform file configuration:
terraform validate
-
Check the list of cloud resources you are about to create:
terraform plan
-
Create resources:
terraform apply
-
Wait until the resources are deployed and save the resulting command output:
Outputs: nat-a_public_ip_address = "***.***.129.139" nat-b_public_ip_address = "***.***.105.234" path_for_private_ssh_key = "./pt_key.pem" test_vm_password = <sensitive> vm_username = "admin"
Enable the route-switcher module
-
Make sure the NAT instances are running and available within the network:
Management console- In the management console
, select the appropriate folder. - Select Network Load Balancer and go to the
route-switcher-lb-...
network load balancer page. - Open the target group and make sure the target resources are
Healthy
.
- In the management console
-
Open the
route-switcher.tf
file, for example, using thenano
editor:nano route-switcher.tf
-
Change the value of the
start_module
parameter for theroute-switcher
module totrue
. -
Run the module with the following command:
terraform apply
Within 5 minutes of resource deployment, the route-switcher module starts providing fault tolerance of outgoing traffic to the internet via the NAT instance.
Test the solution for performance and fault tolerance
Testing the system performance
-
Connect to the
test-vm
serial console:Management console- In the management console
, select the appropriate folder. - Select Compute Cloud.
- In the VM list, select
test-vm
. - Go to the Serial console tab.
- Wait for the operating system to start up completely.
- In the management console
-
Enter the
admin
username and password.
To find out the password, run the following command in your workstation's terraform scenario folder:terraform output test_vm_password
-
Make sure
test-vm
is connected to the internet via the public IP address of thenat-a
instance. Run the following command in the serial console:curl ifconfig.co
Compare the IP address with the
nat-a_public_ip_address
value from the resulting output. -
Enable outgoing traffic from the test VM to a resource on the internet using the
ping
command:ping ya.ru
Make sure that packets are returned:
PING ya.ru (77.88.55.242) 56(84) bytes of data. 64 bytes from ya.ru (77.88.55.242): icmp_seq=1 ttl=56 time=4.67 ms 64 bytes from ya.ru (77.88.55.242): icmp_seq=2 ttl=56 time=3.83 ms 64 bytes from ya.ru (77.88.55.242): icmp_seq=3 ttl=56 time=3.80 ms 64 bytes from ya.ru (77.88.55.242): icmp_seq=4 ttl=56 time=3.78 ms
-
Make sure the
Next hop
value in the route table for thedemo
network matches the internal IP address of thenat-a
instance.
Testing the system fault tolerance
-
Disable the main NAT instance by emulating a system failure:
Management consoleCLIAPI- In the management console
, select the appropriate folder. - Select Compute Cloud.
- Select the
nat-a
VM from the list, click , and select Stop. - In the window that opens, click Stop.
-
See the description of the CLI command to stop a VM:
yc compute instance stop --help
-
Stop the VM:
yc compute instance stop nat-a
Use the stop REST API method for the Instance resource or the InstanceService/Stop gRPC API call.
- In the management console
-
Monitor the loss of packets sent by
ping
.
After the main NAT instance is disabled, there may be a traffic loss for 1 minute on average with subsequent traffic recovery. -
Make sure internet access is now provided via the public IP address of the
nat-b
instance. To do this, in the serial console, stop theping
command and run the following one:curl ifconfig.co
Compare the IP address with the
nat-b_public_ip_address
value from the resulting output. -
Check that the route-switcher has changed the
Next hop
value in the route table for thedemo
network and it now matches the internal IP address of thenat-b
instance. -
Enable outgoing traffic from the test VM using the
ping
command. -
Run the main NAT instance by emulating system recovery:
Management consoleCLIAPI- In the management console
, select the appropriate folder. - Select Compute Cloud.
- Select the
nat-a
VM from the list, click , and select Stop. - In the window that opens, click Start.
-
See the description of the CLI command to stop a VM:
yc compute instance start --help
-
Stop the VM:
yc compute instance start nat-a
Use the start REST API method for the Instance resource or the InstanceService/Start gRPC API call.
- In the management console
-
Monitor the
ping
utility output. While the NAT-A instance is being recovered, there may be no loss of sent packets. -
Make sure internet access is provided via the public IP address of the
nat-a
instance again. To do this, in the serial console, stop theping
command and run the following one:curl ifconfig.co
Compare the IP address with the
nat-a_public_ip_address
value from the resulting output. -
Check that the route-switcher has changed the
Next hop
value in the route table for thedemo
network and it matches the internal IP address of thenat-a
instance again.
How to delete the resources you created
To stop paying for the resources you created, run the command:
terraform destroy
Warning
Terraform will permanently delete all the resources: networks, subnets, VMs, load balancer, etc.