Troubleshooting in Managed Service for Kubernetes
-
Error creating a cluster in a different folder's cloud network
-
Why does the newly created PersistentVolumeClaim remain Pending?
-
Creating a node group with the CLI results in a parameter conflict. How do I fix it?
-
How do I provide internet access to my Managed Service for Kubernetes cluster nodes?
-
Traffic loss when deploying app updates in a cluster with Yandex Application Load Balancer
This section describes typical issues you may encounter while using Managed Service for Kubernetes and gives troubleshooting recommendations.
Error creating a cluster in a different folder's cloud network
Error message:
Permission denied
This error occurs when the resource service account has no required roles in the folder that contains the cloud network selected when creating the cluster.
To create a Managed Service for Kubernetes cluster in a cloud network of another folder, assign the resource service account the following roles in that folder:
To use a public IP address, also assign the vpc.publicAdmin role.
Namespace fails to delete and remains Terminating
This issue occurs when your namespace contains stuck resources that the namespace controller cannot delete.
To fix it, delete the stuck resources manually.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
-
Get the list of resources remaining in the namespace:
kubectl api-resources --verbs=list --namespaced --output=name \ | xargs --max-args=1 kubectl get --show-kind \ --ignore-not-found --namespace=<namespace> -
Delete the listed resources:
kubectl delete <resource_type> <resource_name> --namespace=<namespace>
If the namespace is still in the Terminating status and cannot be deleted, delete it forcibly using finalizer:
-
Run a local proxy to the Kubernetes API:
kubectl proxy -
Delete the namespace:
kubectl get namespace <namespace> --output=json \ | jq '.spec = {"finalizers":[]}' > temp.json && \ curl --insecure --header "Content-Type: application/json" \ --request PUT --data-binary @temp.json \ 127.0.0.1:8001/api/v1/namespaces/<namespace>/finalize
We do not recommend deleting the namespace with the Terminating status using finalizer right away, as this may cause the stuck resources to remain in your Managed Service for Kubernetes cluster.
I am using Yandex Network Load Balancer together with an ingress controller. Why are some of my cluster's nodes UNHEALTHY?
This is normal behavior for a load balancer with External Traffic Policy: Local enabled. Only the Managed Service for Kubernetes nodes whose pods are ready to handle user traffic get the HEALTHY status. All other nodes are labeled as UNHEALTHY.
To check the policy type of a load balancer created using a LoadBalancer service, run this command:
kubectl describe svc <LoadBalancer_service_name> \
| grep 'External Traffic Policy'
For more information, see Parameters of a LoadBalancer service.
Why does the newly created PersistentVolumeClaim remain Pending?
This is normal for a PersistentVolumeClaim (PVC). The newly created PVC remains Pending until you create a pod that will use it.
To change the PVC status to Running:
-
View the PVC details:
kubectl describe pvc <PVC_name> \ --namespace=<namespace>Where
--namespaceis the namespace containing the PVC.The
waiting for first consumer to be created before bindingmessage means that the PVC is awaiting pod creation. -
Create a pod for this PVC.
Why does my Managed Service for Kubernetes cluster fail to start after I update its node configuration?
Make sure the new configuration of Managed Service for Kubernetes nodes is within the quota:
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the yc config set folder-id <folder_ID> command. You can also set a different folder for any specific command using the --folder-name or --folder-id parameter.
To run diagnostics for your Managed Service for Kubernetes cluster nodes:
-
Check the state of Managed Service for Kubernetes nodes:
yc managed-kubernetes cluster list-nodes <cluster_ID>A message saying that the limit of Managed Service for Kubernetes cluster resources has been exceeded appears in the first column of the command output. Here is an example:
+--------------------------------+-----------------+------------------+-------------+--------------+ | CLOUD INSTANCE | KUBERNETES NODE | RESOURCES | DISK | STATUS | +--------------------------------+-----------------+------------------+-------------+--------------+ | fhmil14sdienhr5uh89no | | 2 100% core(s), | 64.0 GB hdd | PROVISIONING | | CREATING_INSTANCE | | 4.0 GB of memory | | | | [RESOURCE_EXHAUSTED] The limit | | | | | | on total size of network-hdd | | | | | | disks has exceeded., | | | | | | [RESOURCE_EXHAUSTED] The limit | | | | | | on total size of network-hdd | | | | | | disks has exceeded. | | | | | +--------------------------------+-----------------+------------------+-------------+--------------+
To start your Managed Service for Kubernetes cluster, increase the quotas.
After changing the node subnet mask in the cluster settings, the number of pods per node is not as expected
Solution: Recreate the node group.
Error updating ingress controller certificate
Error message:
ERROR controller-runtime.manager.controller.ingressgroup Reconciler error
{"name": "some-prod", "namespace": , "error": "rpc error: code = InvalidArgument
desc = Validation error:\nlistener_specs[1].tls.sni_handlers[2].handler.certificate_ids:
Number of elements must be less than or equal to 1"}
The error occurs if different certificates are specified for the same ingress controller handler.
Solution: Edit and apply the ingress controller specifications so that each handler has only one certificate.
Why is DNS resolution not working in my cluster?
A Managed Service for Kubernetes cluster may fail to resolve internal and external DNS requests for several reasons. To fix the issue:
- Check the version of your Managed Service for Kubernetes cluster and node groups.
- Make sure CoreDNS is up and running.
- Make sure your Managed Service for Kubernetes cluster has enough CPU resources available.
- Set up autoscaling.
- Set up local DNS caching.
Check the version of your cluster and node groups
-
Get the list of current Kubernetes versions:
yc managed-kubernetes list-versions -
Get the Managed Service for Kubernetes cluster version:
yc managed-kubernetes cluster get <cluster_name_or_ID> | grep version:You can get the Managed Service for Kubernetes cluster ID and name with the list of clusters in the folder.
-
Get the Managed Service for Kubernetes node group version:
yc managed-kubernetes node-group get <node_group_name_or_ID> | grep version:You can get the Managed Service for Kubernetes node group ID and name with the list of node groups in the cluster.
-
If the versions of your Managed Service for Kubernetes cluster and node groups are not on the list of current Kubernetes versions, upgrade them.
Make sure CoreDNS is up and running
Get the list of CoreDNS pods and their statuses:
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
Make sure all pods have the Running status.
Make sure your cluster has enough CPU resources available
- Navigate to the folder dashboard
and select Managed Service for Kubernetes. - Click the name of the Managed Service for Kubernetes cluster you need and select the Node manager tab.
- Go to the Nodes tab and click the name of any Managed Service for Kubernetes node.
- Navigate to the Monitoring tab.
- Make sure that, in the CPU, [cores] chart, the
usedCPU values have not reached thetotalavailable CPU values. Check this for each Managed Service for Kubernetes cluster node.
Set up autoscaling
Set up DNS autoscaling based on the Managed Service for Kubernetes cluster size.
Set up local DNS caching
Set up NodeLocal DNS Cache. For optimal settings, install NodeLocal DNS Cache from Yandex Cloud Marketplace.
Creating a node group with the CLI results in a parameter conflict. How do I fix it?
Check whether you are specifying the --location, --network-interface, and --public-ip parameters in the same command. Providing them together causes the following errors:
-
For the
--locationand--public-ipor--locationand--network-interfacepairs:ERROR: rpc error: code = InvalidArgument desc = Validation error: allocation_policy.locations[0].subnet_id: can't use "allocation_policy.locations[0].subnet_id" together with "node_template.network_interface_specs" -
For the
--network-interfaceand--public-ippair:ERROR: flag --public-ip cannot be used together with --network-interface. Use '--network-interface' option 'nat' to get public address
Make sure you only provide one of the three parameters in a command. It is enough to specify the location of a Managed Service for Kubernetes node group either in --location or in --network-interface.
To grant internet access to Managed Service for Kubernetes cluster nodes, do one of the following:
- Assign a public IP address to the cluster nodes, specifying
--network-interface ipv4-address=nator--network-interface ipv6-address=nat. - Enable access to Managed Service for Kubernetes nodes from the internet after creating a node group.
Error connecting to a cluster using kubectl
Error message:
ERROR: cluster has empty endpoint
This error occurs if you try to connect to a cluster with no public IP address and get kubectl credentials for a public IP address using this command:
yc managed-kubernetes cluster \
get-credentials <cluster_name_or_ID> \
--external
To connect to the cluster's private IP address from a VM in the same network, get kubectl credentials using this command:
yc managed-kubernetes cluster \
get-credentials <cluster_name_or_ID> \
--internal
If you need to connect to a cluster from the internet, recreate the cluster and assign it a public IP address.
Errors connecting to a node over SSH
Error messages:
Permission denied (publickey,password)
Too many authentication failures
The following situations cause errors when connecting to a Managed Service for Kubernetes node:
-
No public SSH key is added to the Managed Service for Kubernetes node group metadata.
Solution: Update the Managed Service for Kubernetes node group keys.
-
An invalid public SSH key is added to the Managed Service for Kubernetes node group metadata.
Solution: Change the format of the public key file to the appropriate one and update the Managed Service for Kubernetes node group keys.
-
No private SSH key is added to an authentication agent (
ssh-agent).Solution: Add a private key by running the
ssh-add <path_to_private_key_file>command.
How do I provide internet access to my Managed Service for Kubernetes cluster nodes?
If Managed Service for Kubernetes cluster nodes have no internet access, the following error occurs when trying to connect to the internet:
Failed to pull image "cr.yandex/***": rpc error: code = Unknown desc = Error response from daemon: Gethttps://cr.yandex/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
You can provide internet access to your Managed Service for Kubernetes cluster nodes in several ways:
- Set up a NAT gateway or NAT instance. With static routing in place, traffic will go through a gateway or a separate NAT instance.
- Assign a public IP address to your Managed Service for Kubernetes node group.
Note
If you assigned public IP addresses to the cluster nodes and then configured the NAT gateway or NAT instance, internet access via the public IP addresses will be disabled. For more information, see our Yandex Virtual Private Cloud article.
Why cannot I choose Docker as the container runtime?
Clusters running Kubernetes 1.24 or higher do not support the Docker container runtime. Containerd
Error connecting a GitLab repository to Argo CD
Error message:
FATA[0000] rpc error: code = Unknown desc = error testing repository connectivity: authorization failed
This error occurs if access to GitLab over HTTP(S) is disabled.
Solution: Enable HTTP(S) access. To do this:
- In GitLab, in the left-hand panel, select Admin → Settings → General.
- Under Visibility and access controls, find the Enabled Git access protocols setting.
- In the list, select the item which allows access over HTTP(S).
For more information, see this GitLab guide
Traffic loss when deploying app updates in a cluster with Yandex Application Load Balancer
When your app traffic is managed by an Application Load Balancer and the load balancer's ingress controller traffic policy is set to externalTrafficPolicy: Local, the app processes requests on the same node they were delivered to by the load balancer. There is no traffic flow between nodes.
The default health check monitors the status of the node, not application. Therefore, Application Load Balancer traffic may go to a node where there is no application running. When you deploy a new app version in a cluster, the Application Load Balancer ingress controller requests the load balancer to update the backend group configuration. It takes at least 30 seconds to process the request, and the app may not receive any user traffic during that time.
To prevent this, we recommend setting up backend health checks on your Application Load Balancer. With health checks, the load balancer timely spots unavailable backends and reroutes traffic to healthy backends. Once the application is updated, traffic will again be distributed across all backends.
For more information, see Tips for configuring Yandex Application Load Balancer health checks and Annotations (metadata.annotations).
System time displayed incorrectly on nodes, as well as in container and Managed Service for Kubernetes cluster pod logs
Managed Service for Kubernetes cluster time may not match the time of other resources, such as VMs, if they use different time synchronization sources. For example, a Managed Service for Kubernetes cluster synchronizes with a time server (by default), whereas a VM synchronizes with a private or public NTP server.
Solution: Set up Managed Service for Kubernetes cluster time synchronization with your private NTP server. To do this:
-
Specify the NTP server addresses in the DHCP settings of the master subnets.
Management consoleCLITerraformAPI- Navigate to the folder dashboard and select Managed Service for Kubernetes.
- Click the name of the Kubernetes cluster.
- Under Master configuration, click the subnet name.
- Click
Edit in the top-right corner. - In the window that opens, expand the DHCP settings section.
- Click Add and specify the IP address of your NTP server.
- Click Save changes.
If you do not have the Yandex Cloud CLI installed yet, install and initialize it.
By default, the CLI uses the folder specified when creating the profile. To change the default folder, use the
yc config set folder-id <folder_ID>command. You can also set a different folder for any specific command using the--folder-nameor--folder-idparameter.-
See the description of the CLI command for updating subnet settings:
yc vpc subnet update --help -
Run the
subnetcommand, specifying the NTP server IP address in the--ntp-serverparameter:yc vpc subnet update <subnet_ID> --ntp-server <server_address>
Tip
To find out the IDs of the subnets containing the cluster, get detailed information about the cluster.
-
In the Terraform configuration file, change the cluster subnet description. Add the
dhcp_optionssection (if missing) with thentp_serversparameter specifying the IP address of your NTP server:... resource "yandex_vpc_subnet" "lab-subnet-a" { ... v4_cidr_blocks = ["<IPv4_address>"] network_id = "<network_ID>" ... dhcp_options { ntp_servers = ["<IPv4_address>"] ... } } ...For more information about the
yandex_vpc_subnetsettings, see this Terraform provider article. -
Apply the changes:
-
In the terminal, go to the directory where you edited the configuration file.
-
Make sure the configuration file is correct using this command:
terraform validateIf the configuration is correct, you will get this message:
Success! The configuration is valid. -
Run this command:
terraform planYou will see a detailed list of resources. No changes will be made at this step. If the configuration contains any errors, Terraform will show them.
-
Apply the changes:
terraform apply -
Type
yesand press Enter to confirm the changes.
Terraform will update all required resources. You can check the subnet update using the management console
or this CLI command:yc vpc subnet get <subnet_name> -
Use the update method for the Subnet resource and provide the following in the request:
- NTP server IP address in the
dhcpOptions.ntpServersparameter. dhcpOptions.ntpServersparameter to update in theupdateMaskparameter.
Tip
To find out the IDs of the subnets containing the cluster, get detailed information about the cluster.
Warning
For a highly available master hosted across three availability zones, you need to update each of the three subnets.
-
Enable connections from the cluster to NTP servers.
Create a rule for outbound traffic in the cluster and node group security group:
- Port range:
123. If your NTP server uses a port other than123, specify that port. - Protocol:
UDP. - Destination name:
CIDR. - CIDR blocks:
<NTP_server_IP_address>/32. For a master hosted across three availability zones, specify three sections:<NTP_server_IP_address_in_subnet1>/32,<NTP_server_IP_address_in_subnet2>/32, and<NTP_server_IP_address_in_subnet3>/32.
- Port range:
-
Update the network settings in the cluster node group using one of the following methods:
- Connect to each node in the group over SSH or via OS Login and run the
sudo dhclient -v -r && sudo dhclientcommand. - Reboot the group nodes at any convenient time.
Warning
Updating network settings may cause the services within the cluster to become unavailable for a few minutes.
- Connect to each node in the group over SSH or via OS Login and run the
What should I do if I deleted my Yandex Network Load Balancer or its target groups that were automatically created for a LoadBalancer service?
You cannot manually restore a Network Load Balancer or its target groups. Recreate your LoadBalancer service. This will automatically create a load balancer and target groups.