Implementing fault-tolerant use cases for network VMs
In Yandex Cloud, you can deploy a cloud infrastructure using network VMs that provide firewall protection, network security, and traffic routing. With static routing, traffic is routed from subnets to network VMs.
To ensure high availability, we will deploy two NAT VMs in different availability zones, using route-switcher module
This tutorial describes a use case when the route-switcher module provides fault tolerance of a NAT instance, a network VM with preset routing and IP address translation rules. NAT instances help provide internet access for VMs and other cloud resources hosted in Yandex Cloud.
In the flow chart below, NAT-A
is the main internet gateway, while NAT-B
is a standby one.
Description of the scheme elements
Element name | Description |
---|---|
NAT-A, NAT-B | NAT instances that provide internet access to cloud resources by translating the resources' internal IP addresses to the NAT instances' public IPs. |
VPC: demo | Virtual Private Cloud network |
private-a | Subnet in the ru-central1-a availability zone for hosting resources that require internet access. |
public-a, public-b | Subnets in the ru-central1-a and ru-central1-b availability zones hosting the NAT instances. |
public ip a, public ip b | NAT instances’ public IP addresses. |
NLB | Internal network load balancer required for the route-switcher module to run; it checks whether the NAT instances are available by performing health checks on port TCP 22. |
If NAT-A
fails, route-switcher will switch outbound traffic to NAT-B
by changing its route table’s Next hop
value to the NAT-B
internal IP address. After that, NAT-B
will provide internet access.
Once NAT-A
recovers, route-switcher will change the Next hop
value to the NAT-A
internal IP address, thus rerouting outbound traffic to NAT-A
.
In this tutorial, we will create a test infrastructure showing how route-switcher works. Our example will include the following components:
- nat-a: Main NAT instance.
- nat-b: Standby NAT instance.
- test-vm: VM within the infrastructure's internal perimeter that is going to have internet access through the respective NAT instance.
- route-switcher-lb-...: Network load balancer required for the route-switcher module to run and used to check if the NAT instances are available.
- route-switcher-...: Cloud function that switches outgoing traffic over to the standby NAT instance if the main one is down.
To deploy the infrastructure and test route-switcher:
- Get your cloud ready.
- Prepare the environment.
- Deploy your resources.
- Enable the route-switcher module.
- Test the solution for performance and fault tolerance.
If you no longer need the resources you created, delete them.
Get your cloud ready
Sign up for Yandex Cloud and create a billing account:
- Go to the management console
and log in to Yandex Cloud or create an account if you do not have one yet. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one.
If you have an active billing account, you can go to the cloud page
Learn more about clouds and folders.
Required paid resources
The infrastructure support cost includes:
- Fee for continuously running VMs (see Yandex Compute Cloud pricing).
- Fee for using Network Load Balancer (see Yandex Network Load Balancer pricing).
- Fee for IP addresses and outbound traffic (see Yandex Virtual Private Cloud pricing).
- Fee for using the function (see Yandex Cloud Functions pricing).
Configure the CLI profile
-
If you do not have the Yandex Cloud command line interface yet, install it and sign in as a user.
-
Create a service account:
Management consoleCLIAPI- In the management console
, select the folder where you want to create a service account. - From the list of services, select Identity and Access Management.
- Click Create service account.
- Specify the service account name, e.g.,
sa-terraform
. - Click Create.
The folder specified in the CLI profile is used by default. You can specify a different folder through the
--folder-name
or--folder-id
parameter.To create a service account, run the command below and specify the
sa-terraform
name:yc iam service-account create --name sa-terraform
Where
name
is the service account name.Result:
id: ajehr0to1g8b******** folder_id: b1gv87ssvu49******** created_at: "2023-06-20T09:03:11.665153755Z" name: sa-terraform
To create a service account, use the ServiceAccountService/Create gRPC API call or the create REST API method for the
ServiceAccount
resource. - In the management console
-
Assign the service account the administrator role for the folder:
Management consoleCLIAPI- On the management console home page
, select a folder. - Navigate to the Access bindings tab.
- Find the
sa-terraform
account in the list and click . - Click Edit roles.
- In the dialog that opens, click Add role and select the
admin
role.
Run this command:
yc resource-manager folder add-access-binding <folder_ID> \ --role admin \ --subject serviceAccount:<service_account_ID>
To assign a service account a role for a folder, use the setAccessBindings REST API method for the ServiceAccount resource or the ServiceAccountService/SetAccessBindings gRPC API call.
- On the management console home page
-
Set up a CLI profile to run operations on behalf of the service account:
CLI-
Create a service account authorized key and save it to the file:
yc iam key create \ --service-account-id <service_account_ID> \ --folder-id <service_account_folder_ID> \ --output key.json
Where:
service-account-id
: Service account ID.folder-id
: ID of the service account folder.output
: Name of the authorized key file.
Result:
id: aje8nn871qo4******** service_account_id: ajehr0to1g8b******** created_at: "2023-06-20T09:16:43.479156798Z" key_algorithm: RSA_2048
-
Create a CLI profile to perform operations under the service account:
yc config profile create sa-terraform
Result:
Profile 'sa-terraform' created and activated
-
Set the profile configuration:
yc config set service-account-key key.json
Where:
service-account-key
: Authorized key file name. -
Add your credentials to the environment variables:
export YC_TOKEN=$(yc iam create-token)
-
Set up the environment for deploying the resources
-
Install Git
with this command:sudo apt install git
-
Clone the
yandex-cloud-examples/yc-route-switcher
GitHub repository and navigate to the directory containing resources for our example:git clone https://github.com/yandex-cloud-examples/yc-route-switcher.git cd yc-route-switcher/examples
-
Open the
terraform.tfvars
file, e.g., using thenano
editor:nano terraform.tfvars
-
Edit the following:
-
Line with the folder ID:
folder_id = "<folder_ID>"
-
String with a list of allowed public IP addresses for
test-vm
access:trusted_ip_for_mgmt = ["<workstation_external_IP_address>/32"]
Where:
<workstation_external_IP_address>
is your computer public IP address.To find out the external IP address of your workstation, run:
curl 2ip.ru
Result:
192.240.24.87
-
Deploy your resources
-
Initialize Terraform:
terraform init
-
Check whether the Terraform configuration files are correct:
terraform validate
-
Preview your new cloud resources:
terraform plan
-
Create the resources:
terraform apply
-
Wait until the command completes and save its output:
Outputs: nat-a_public_ip_address = "***.***.129.139" nat-b_public_ip_address = "***.***.105.234" path_for_private_ssh_key = "./pt_key.pem" test_vm_password = <sensitive> vm_username = "admin"
Enable the route-switcher module
-
Make sure the NAT gateways are running and available from the internal network:
Management console- In the management console
, select your infrastructure folder. - Select Network Load Balancer and navigate to the
route-switcher-lb-...
page. - Open the target group and make sure its resources are
Healthy
.
- In the management console
-
Open the
route-switcher.tf
file, e.g., using thenano
editor:nano route-switcher.tf
-
Change the
start_module
value in theroute-switcher
module totrue
. -
Start the module with this command:
terraform apply
Within five minutes, route-switcher will start its work, providing fault tolerance for outbound NAT traffic.
Test the solution for performance and fault tolerance
Testing the system performance
-
Connect to the
test-vm
serial console:Management console- In the management console
, select your infrastructure folder. - Select Compute Cloud.
- In the VM list, select
test-vm
. - Navigate to the Serial console tab.
- Wait for the operating system to start up completely.
- In the management console
-
Enter the
admin
username and password.
To get the password, run this command in your computer terraform scenario directory:terraform output test_vm_password
-
Make sure
test-vm
is connected to the internet via the public IP address ofnat-a
. Run the following command in the serial console:curl ifconfig.co
Compare the IP address you get with
nat-a_public_ip_address
you saved earlier. -
Emulate
test VM
outbound traffic by runningping
:ping ya.ru
Make sure you get ICMP response:
PING ya.ru (77.88.55.242) 56(84) bytes of data. 64 bytes from ya.ru (77.88.55.242): icmp_seq=1 ttl=56 time=4.67 ms 64 bytes from ya.ru (77.88.55.242): icmp_seq=2 ttl=56 time=3.83 ms 64 bytes from ya.ru (77.88.55.242): icmp_seq=3 ttl=56 time=3.80 ms 64 bytes from ya.ru (77.88.55.242): icmp_seq=4 ttl=56 time=3.78 ms
-
Make sure the
Next hop
value in the route table for thedemo
network matches the internal IP address ofnat-a
.
Testing the system fault tolerance
-
Emulate a system failure by stopping the main NAT gateway:
Management consoleCLIAPI- In the management console
, select your infrastructure folder. - Select Compute Cloud.
- Select the
nat-a
VM from the list, click , and select Stop. - In the window that opens, click Stop.
-
See the description of the CLI command for stopping a VM:
yc compute instance stop --help
-
Stop the VM:
yc compute instance stop nat-a
Use the stop REST API method for the Instance resource or the InstanceService/Stop gRPC API call.
- In the management console
-
Monitor the loss of packets sent by
ping
.
After the main NAT instance is disabled, there may be a traffic loss for around one minute, and then the traffic should recover. -
Make sure internet access is now provided via the public IP address of
nat-b
. To do this, in the serial console, stop theping
command and run the following one:curl ifconfig.co
Compare the IP address with the
nat-b_public_ip_address
value from the resulting output. -
Check that the route-switcher has changed the
Next hop
value in the route table for thedemo
network and it now matches the internal IP address ofnat-b
. -
Enable outgoing traffic from the test VM using the
ping
command. -
Run the main NAT instance by emulating system recovery:
Management consoleCLIAPI- In the management console
, select your infrastructure folder. - Select Compute Cloud.
- Select the
nat-a
VM from the list, click , and select Stop. - In the window that opens, click Start.
-
See the description of the
instance stop
CLI command:yc compute instance start --help
-
Stop the VM:
yc compute instance start nat-a
Use the start REST API method for the Instance resource or the InstanceService/Start gRPC API call.
- In the management console
-
Monitor the
ping
utility output. While the NAT-A instance is being recovered, there may be no loss of sent packets. -
Make sure internet access is provided via the public IP address of
nat-a
again. To do this, in the serial console, stop theping
command and run the following one:curl ifconfig.co
Compare the IP address with the
nat-a_public_ip_address
value from the resulting output. -
Check that the route-switcher has changed the
Next hop
value in the route table for thedemo
network and it matches the internal IP address ofnat-a
again.
How to delete the resources you created
To stop paying for the resources you created, run this command:
terraform destroy
Warning
Terraform will permanently delete all the resources you created, e.g., networks, subnets, VMs, load balancer, etc.