Troubleshooting in Managed Service for Kubernetes
This section describes typical problems you may encounter while using Managed Service for Kubernetes and gives troubleshooting recommendations.
Error creating a cluster in a different folder's cloud network
Error message:
Permission denied
The error occurs when the resource service account has no required roles in the folder whose cloud network is selected when creating a cluster.
To create a Managed Service for Kubernetes cluster in a cloud network of another folder, assign the resource service account the following roles in this folder:
To use a public IP address, also assign the vpc.publicAdmin role.
A namespace has been deleted but its status is Terminating and its deletion is not completed
This happens when a namespace has stuck resources that cannot be deleted by the namespace controller.
To fix the issue, delete the stuck resources manually.
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
-
Get a list of resources that remain within the namespace:
kubectl api-resources --verbs=list --namespaced --output=name \ | xargs --max-args=1 kubectl get --show-kind \ --ignore-not-found --namespace=<namespace>
-
Delete the resources found:
kubectl delete <resource_type> <resource_name> --namespace=<namespace>
That being done, if the namespace is still in the Terminating
status and cannot be deleted, delete it forcibly using finalizer
:
-
Enable Kubernetes API proxy to your local computer:
kubectl proxy
-
Delete the namespace:
kubectl get namespace <namespace> --output=json \ | jq '.spec = {"finalizers":[]}' > temp.json && \ curl --insecure --header "Content-Type: application/json" \ --request PUT --data-binary @temp.json \ 127.0.0.1:8001/api/v1/namespaces/<namespace>/finalize
We do not recommend deleting the namespace with the Terminating
status using finalizer
right away, as this may cause the stuck resources to remain in your Managed Service for Kubernetes cluster.
I am using Yandex Network Load Balancer alongside an Ingress controller. Why are some of my cluster's nodes UNHEALTHY?
This is normal behavior for a load balancer with External Traffic Policy: Local
enabled. Only the Managed Service for Kubernetes nodes whose pods are ready to accept user traffic get the HEALTHY
status. The rest of the nodes are labeled as UNHEALTHY
.
To find out the policy type of a load balancer created using a LoadBalancer
type service, run this command:
kubectl describe svc <LoadBalancer_type_service_name> \
| grep 'External Traffic Policy'
For more information, see Parameters of a LoadBalancer service.
Why is a created PersistentVolumeClaim still pending?
This is normal for a PersistentVolumeClaim (PVC). The created PVC remains in the Pending status until you create a pod that must use it.
To change the PVC status to Running:
-
View details of the PVC:
kubectl describe pvc <PVC_name> \ --namespace=<namespace_PVC_resides_in>
A message saying
waiting for first consumer to be created before binding
means that the PVC is waiting for a pod to be created. -
Create a pod for this PVC.
Why does my Managed Service for Kubernetes cluster fail to run after I change its node configuration?
Make sure the new configuration of Managed Service for Kubernetes nodes is within the quota:
If you do not have the Yandex Cloud command line interface yet, install and initialize it.
The folder specified in the CLI profile is used by default. You can specify a different folder using the --folder-name
or --folder-id
parameter.
To run diagnostics for your Managed Service for Kubernetes cluster nodes:
-
Check the health of Managed Service for Kubernetes nodes:
yc managed-kubernetes cluster list-nodes <cluster_ID>
A message saying that the allowed amount of Managed Service for Kubernetes cluster resources has been exceeded is displayed in the first column of the command output. Example:
+--------------------------------+-----------------+------------------+-------------+--------------+ | CLOUD INSTANCE | KUBERNETES NODE | RESOURCES | DISK | STATUS | +--------------------------------+-----------------+------------------+-------------+--------------+ | fhmil14sdienhr5uh89no | | 2 100% core(s), | 64.0 GB hdd | PROVISIONING | | CREATING_INSTANCE | | 4.0 GB of memory | | | | [RESOURCE_EXHAUSTED] The limit | | | | | | on total size of network-hdd | | | | | | disks has exceeded., | | | | | | [RESOURCE_EXHAUSTED] The limit | | | | | | on total size of network-hdd | | | | | | disks has exceeded. | | | | | +--------------------------------+-----------------+------------------+-------------+--------------+
To run your Managed Service for Kubernetes cluster, increase the quotas.
An error occurs when renewing an Ingress controller certificate
Error message:
ERROR controller-runtime.manager.controller.ingressgroup Reconciler error
{"name": "some-prod", "namespace": , "error": "rpc error: code = InvalidArgument
desc = Validation error:\nlistener_specs[1].tls.sni_handlers[2].handler.certificate_ids:
Number of elements must be less than or equal to 1"}
The error occurs if different certificates are specified for the same Ingress controller listener.
Solution: Edit and apply the Ingress controller specifications making sure that only one certificate is specified in each listener's description.
Why is DNS name resolution not working in my cluster?
There may be no name resolution for internal and external DNS queries in a Managed Service for Kubernetes cluster for several reasons. To fix the issue:
- Check the version of your Managed Service for Kubernetes cluster and node groups.
- Make sure that CoreDNS is up and running.
- Make sure the Managed Service for Kubernetes cluster has enough CPU resources available.
- Set up autoscaling.
- Set up local DNS caching.
Check the version of your cluster and node groups
-
Get a list of current Kubernetes versions:
yc managed-kubernetes list-versions
-
Find out the Managed Service for Kubernetes cluster version:
yc managed-kubernetes cluster get <cluster_name_or_ID> | grep version:
You can get the Managed Service for Kubernetes cluster ID and name with a list of clusters in the folder.
-
Find out the Managed Service for Kubernetes node group version:
yc managed-kubernetes node-group get <node_group_name_or_ID> | grep version:
You can get the ID and name of the Managed Service for Kubernetes node group with a list of node groups in your cluster.
-
If the versions of your Managed Service for Kubernetes cluster and node groups are not on the list of current Kubernetes versions, upgrade them.
Make sure that CoreDNS is up and running
Get a list of CoreDNS pods and their statuses:
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide
Make sure all the pods have the Running
status.
Make sure the cluster has enough CPU resources available
- Go to the folder page
and select Managed Service for Kubernetes. - Click the name of the Managed Service for Kubernetes cluster you need and select the Nodes manager tab.
- Go to the Nodes tab and click the name of any Managed Service for Kubernetes node.
- Go to the Monitoring tab.
- Make sure that, in the CPU, [cores] chart, the
used
CPU values have not reached thetotal
available CPU values. Check this for each Managed Service for Kubernetes cluster node.
Set up autoscaling
Set up automatic DNS scaling by Managed Service for Kubernetes cluster size.
Set up local DNS caching
Set up NodeLocal DNS Cache. To make sure that the settings are optimal, install NodeLocal DNS Cache from Yandex Cloud Marketplace.
When creating a node group via the CLI, a parameter conflict occurs. How do I fix that?
Check whether the --location
, --network-interface
, and --public-ip
parameters are specified in the same command. If you provide these parameters together, the following errors occur:
-
For the
--location
and--public-ip
or--location
and--network-interface
pairs:ERROR: rpc error: code = InvalidArgument desc = Validation error: allocation_policy.locations[0].subnet_id: can't use "allocation_policy.locations[0].subnet_id" together with "node_template.network_interface_specs"
-
For the
--network-interface
and--public-ip
pair:ERROR: flag --public-ip cannot be used together with --network-interface. Use '--network-interface' option 'nat' to get public address
Make sure you only provide one of the three parameters in a command. It is enough to specify the location of a Managed Service for Kubernetes node group either in --location
or --network-interface
.
To assign public IP addresses to Managed Service for Kubernetes nodes, do one of the following:
- Specify
--network-interface ipv4-address=nat
or--network-interface ipv6-address=nat
. - Enable access to Managed Service for Kubernetes nodes from the internet after creating a node group.
kubectl
Error connecting to a cluster using Error message:
ERROR: cluster has empty endpoint
The error occurs if you try to connect to a cluster with no public IP address and get kubectl
credentials for a public IP address using this command:
yc managed-kubernetes cluster \
get-credentials <cluster_name_or_ID> \
--external
To connect to the cluster's private IP address from a VM located in the same network, get kubectl
credentials using this command:
yc managed-kubernetes cluster \
get-credentials <cluster_name_or_ID> \
--internal
If you need to connect to a cluster from the internet, recreate the cluster and assign it a public IP address.
Errors occur when connecting to a node over SSH
Error messages:
Permission denied (publickey,password)
Too many authentication failures
Errors occur when connecting to a Managed Service for Kubernetes node in the following cases:
-
No public SSH key is added to the Managed Service for Kubernetes node group metadata.
Solution: Update the Managed Service for Kubernetes node group keys.
-
An invalid public SSH key is added to the Managed Service for Kubernetes node group metadata.
Solution: Change the format of the public key file to the appropriate one and update the Managed Service for Kubernetes node group keys.
-
No private SSH key is added to an authentication agent (ssh-agent).
Solution: Add a private key by running the following command:
ssh-add <path_to_private_key_file>
.
How do I grant internet access to Managed Service for Kubernetes cluster nodes?
If Managed Service for Kubernetes cluster nodes have no access to the internet, the following error occurs when trying to connect to the internet:
Failed to pull image "cr.yandex/***": rpc error: code = Unknown desc = Error response from daemon: Gethttps://cr.yandex/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
There are several ways to grant internet access to Managed Service for Kubernetes cluster nodes:
- Create and configure a NAT gateway or NAT instance. As a result, through static routing, traffic will be routed via the gateway or a separate VM instance with NAT features.
- Assign a public IP address to a Managed Service for Kubernetes node group.
Note
If you assigned public IP addresses to the cluster nodes and then configured the NAT gateway or NAT instance, internet access via the public IP addresses will be disabled. For more information, see the Yandex Virtual Private Cloud documentation.
Why cannot I choose Docker as the container runtime environment?
There is no support for Docker as a container runtime environment in clusters with Kubernetes version 1.24 or higher. Only containerd