Implementing fault-tolerant scenarios for NAT VMs
In Yandex Cloud, you can deploy a cloud infrastructure using network VMs that provide firewall protection, network security, and traffic routing. With static routing, you can route traffic from subnets to network VMs.
To ensure high availability, you can deploy multiple network VMs in different availability zones and use a route switcher
In our scenario, the route switcher ensures fault tolerance of a NAT instance, a network VM with preset routing and IP address translation rules providing internet access for Yandex Cloud resources.
In the flow chart below, NAT-A
is the main egress internet gateway, while NAT-B
is a standby one.
Chart description
Element name | Description |
---|---|
NAT-A, NAT-B | NAT instances that enable internet access for cloud resources by translating the resources' private IP addresses to the NAT instances' public IP addresses. |
VPC: demo | Virtual Private Cloud network |
private-a | Subnet in the ru-central1-a availability zone, hosting resources that require internet access. |
public-a, public-b | Subnets in the ru-central1-a and ru-central1-b availability zones, hosting NAT instances |
public ip a, public ip b | NAT instance public IP addresses |
NLB | Internal network load balancer for the route switcher, performing NAT instance health checks by probing TCP port 22 |
If NAT-A
fails, the route switcher will switch outbound traffic to NAT-B
by changing its Next hop
value to the NAT-B
internal IP address in the route table. After that, NAT-B
will provide internet access.
As soon as NAT-A
recovers, the route switcher will change the Next hop
value to the NAT-A
internal IP address, thus rerouting outbound traffic through NAT-A
.
In this tutorial, we will create a test infrastructure showing how a route switcher works. Our example will include the following basic components:
- nat-a: Main NAT instance.
- nat-b: Standby NAT instance.
- test-vm: Internal VM accessing the internet through a NAT instance.
- route-switcher-lb-...: Network load balancer for the route switcher, running health checks on the NAT instances.
- route-switcher-...: Cloud function switching outbound traffic to the standby NAT instance if the main one is down.
To deploy the infrastructure and test your route switcher:
- Get your cloud ready.
- Set up your environment.
- Deploy your resources.
- Enable the route switcher.
- Test the solution for performance and fault tolerance.
If you no longer need the resources you created, delete them.
Get your cloud ready
Sign up in Yandex Cloud and create a billing account:
- Navigate to the management console
and log in to Yandex Cloud or register a new account. - On the Yandex Cloud Billing
page, make sure you have a billing account linked and it has theACTIVE
orTRIAL_ACTIVE
status. If you do not have a billing account, create one and link a cloud to it.
If you have an active billing account, you can navigate to the cloud page
Learn more about clouds and folders.
Required paid resources
The infrastructure support cost includes:
- Fee for continuously running VMs (see Yandex Compute Cloud pricing).
- Fee for using Network Load Balancer (see Yandex Network Load Balancer pricing).
- Fee for public IP addresses and outbound traffic (see Yandex Virtual Private Cloud pricing).
- Fee for using the function (see Yandex Cloud Functions pricing).
Configure your CLI profile
-
If you do not have the Yandex Cloud CLI yet, install it and sign in as a user.
-
Create a service account:
Management consoleCLIAPI- In the management console
, select the folder where you want to create a service account. - In the list of services, select Identity and Access Management.
- Click Create service account.
- Enter a name for the service account, e.g.,
sa-terraform
. - Click Create.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the
yc config set folder-id <folder_ID>
command. You can also set a different folder for any specific command using the--folder-name
or--folder-id
parameter.To create a service account, run the command below, specifying
sa-terraform
as the service account name:yc iam service-account create --name sa-terraform
Where
name
is the service account name.Result:
id: ajehr0to1g8b******** folder_id: b1gv87ssvu49******** created_at: "2023-06-20T09:03:11.665153755Z" name: sa-terraform
To create a service account, use the ServiceAccountService/Create gRPC API call or the create REST API method for the
ServiceAccount
resource. - In the management console
-
Assign the admin role for the folder to the service account:
Management consoleCLIAPI- On the management console home page
, select a folder. - Navigate to the Access bindings tab.
- Find the
sa-terraform
account in the list and click . - Click Edit roles.
- In the dialog that opens, click Add role and select
admin
.
Run this command:
yc resource-manager folder add-access-binding <folder_ID> \ --role admin \ --subject serviceAccount:<service_account_ID>
To assign a role for a folder to a service account, use the setAccessBindings REST API method for the ServiceAccount resource or the ServiceAccountService/SetAccessBindings gRPC API call.
- On the management console home page
-
Set up the CLI profile to run operations under the service account:
CLI-
Create an authorized key for the service account and save it to the file:
yc iam key create \ --service-account-id <service_account_ID> \ --folder-id <ID_of_folder_with_service_account> \ --output key.json
Where:
service-account-id
: Service account ID.folder-id
: Service account folder ID.output
: Authorized key file name.
Result:
id: aje8nn871qo4******** service_account_id: ajehr0to1g8b******** created_at: "2023-06-20T09:16:43.479156798Z" key_algorithm: RSA_2048
-
Create a CLI profile to run operations under the service account:
yc config profile create sa-terraform
Result:
Profile 'sa-terraform' created and activated
-
Configure the profile:
yc config set service-account-key key.json
Where:
service-account-key
: Service account authorized key file. -
Add your credentials to the environment variables:
export YC_TOKEN=$(yc iam create-token)
-
Set up an environment for deploying the resources
-
Install Git
using the following command:sudo apt install git
-
Clone the
yandex-cloud-examples/yc-route-switcher
GitHub repository and navigate to the directory containing resources for our example:git clone https://github.com/yandex-cloud-examples/yc-route-switcher.git cd yc-route-switcher/examples
-
Open the
terraform.tfvars
file in a text editor, such asnano
:nano terraform.tfvars
-
Edit the following:
-
Folder ID line:
folder_id = "<folder_ID>"
-
Line with a list of public IP addresses allowed to access
test-vm
:trusted_ip_for_mgmt = ["<workstation_external_IP_address>/32"]
Where:
<workstation_external_IP_address>
is your computer public IP address.To get your computer public IP address, run this command:
curl 2ip.ru
Result:
192.240.24.87
-
Deploy your resources
-
Initialize Terraform:
terraform init
-
Check whether the Terraform configuration files are correct:
terraform validate
-
Check the list of new cloud resources:
terraform plan
-
Create the resources:
terraform apply
-
Wait until the command completes and save its output:
Outputs: nat-a_public_ip_address = "***.***.129.139" nat-b_public_ip_address = "***.***.105.234" path_for_private_ssh_key = "./pt_key.pem" test_vm_password = <sensitive> vm_username = "admin"
Enable the route switcher
-
Make sure the NAT instances are running and accessible from the internal network:
Management console- In the management console
, select your infrastructure folder. - Select Network Load Balancer and navigate to the
route-switcher-lb-...
page. - Expand the target group and check whether its resources are
Healthy
.
- In the management console
-
Open the
route-switcher.tf
file in a text editor, such asnano
:nano route-switcher.tf
-
Change the
start_module
value in theroute-switcher
module totrue
. -
Start the module with this command:
terraform apply
Within five minutes, the route switcher will start, providing fault tolerance for outbound NAT traffic.
Test the solution for performance and fault tolerance
Testing the system performance
-
Connect to the
test-vm
serial console:Management console- In the management console
, select your infrastructure folder. - Select Compute Cloud.
- In the VM list, select
test-vm
. - Navigate to the Serial console tab.
- Wait for the operating system to boot.
- In the management console
-
Enter the
admin
username and password.
To get the password, run this command from the Terraform directory on your computer:terraform output test_vm_password
-
Make sure
test-vm
uses thenat-a
public IP address to access the internet by running this command in the serial console:curl ifconfig.co
Compare the IP address you get with
nat-a_public_ip_address
you saved earlier. -
Run a
ping
to a public host to triggertest VM
outbound traffic:ping ya.ru
Make sure you get an ICMP response:
PING ya.ru (77.88.55.242) 56(84) bytes of data. 64 bytes from ya.ru (77.88.55.242): icmp_seq=1 ttl=56 time=4.67 ms 64 bytes from ya.ru (77.88.55.242): icmp_seq=2 ttl=56 time=3.83 ms 64 bytes from ya.ru (77.88.55.242): icmp_seq=3 ttl=56 time=3.80 ms 64 bytes from ya.ru (77.88.55.242): icmp_seq=4 ttl=56 time=3.78 ms
-
Check the route table to make sure the
Next hop
value for thedemo
network matches thenat-a
internal IP address.
Testing the system fault tolerance
-
Emulate a system failure by stopping the main NAT instance:
Management consoleCLIAPI- In the management console
, select your infrastructure folder. - Select Compute Cloud.
- Select the
nat-a
VM from the list, click , and select Stop. - In the window that opens, click Stop.
-
See the description of the CLI command for stopping a VM:
yc compute instance stop --help
-
Stop the VM:
yc compute instance stop nat-a
Use the stop REST API method for the Instance resource or the InstanceService/Stop gRPC API call.
- In the management console
-
Monitor the loss of
ping
packets.
After the main NAT instance gets disabled, you may see a traffic loss for about a minute with the subsequent traffic recovery. -
Make sure
test-vm
now uses thenat-b
public IP address to access the internet by stoppingping
and running this command in the serial console:curl ifconfig.co
Compare the IP address you get with
nat-b_public_ip_address
you saved earlier. -
Check the route table to make sure the route switcher changed the
Next hop
value for thedemo
network to thenat-b
internal IP address. -
Run a
ping
to triggertest VM
outbound traffic: -
Emulate the system recovery by starting the main NAT instance:
Management consoleCLIAPI- In the management console
, select your infrastructure folder. - Select Compute Cloud.
- Select the
nat-a
VM from the list, click , and select Stop. - In the window that opens, click Start.
-
See the description of the CLI command for starting a VM:
yc compute instance start --help
-
Start the VM:
yc compute instance start nat-a
Use the start REST API method for the Instance resource or the InstanceService/Start gRPC API call.
- In the management console
-
Monitor the
ping
output. AsNAT-A
recovers, you may not see any packet loss. -
Make sure
test-vm
now uses thenat-a
public IP address to access the internet by stoppingping
and running this command in the serial console:curl ifconfig.co
Compare the IP address you get with
nat-a_public_ip_address
you saved earlier. -
Check the route table to make sure the route switcher changed the
Next hop
value for thedemo
network back to thenat-a
internal IP address.
How to delete the resources you created
To stop paying for the resources you created, run this command:
terraform destroy
Warning
Terraform will permanently delete all resources, such as networks, subnets, VMs, load balancer, etc.