Kubernetes Node Remediation
Kubernetes Node Remediation is a tool comprising several Kubernetes operators that provide automatic recovery of failed Managed Service for Kubernetes cluster nodes and high availability for stateful workloads.
The solution features two controllers:
- Node Healthcheck Controller, which tracks failures.
- Self Node Remediation Controller, which moves the workload away from unhealthy nodes and restores them.
-
Install kubectl and configure it to work with your cluster.
-
Create a node group for Kubernetes Node Remediation.
-
Configure the application:
- Namespace: Create a new namespace, e.g.,
remediation-space. If you leave the default namespace, Kubernetes Node Remediation may work incorrectly. - Application name: Specify the application name.
- Namespace: Create a new namespace, e.g.,
-
Click Install.
-
Wait for the application to change its status to
Deployed. -
Create the
NodeHealthCheckresource:-
Create a file with the
NodeHealthCheckdescription:apiVersion: remediation.medik8s.io/v1alpha1 kind: NodeHealthCheck metadata: name: nodehc-sample spec: minHealthy: 51% remediationTemplate: apiVersion: self-node-remediation.medik8s.io/v1alpha1 kind: SelfNodeRemediationTemplate name: self-node-remediation-automatic-strategy-template namespace: <application_namespace> selector: matchLabels: beta.kubernetes.io/os: linux unhealthyConditions: - duration: 60s status: "False" type: Ready - duration: 60s status: Unknown type: ReadyWhere:
-
spec.minHealthy: Minimum percentage of healthy nodes required to initiate recovery. -
spec.unhealthyConditions: List of node status conditions the controller uses to determine if the node is unhealthy.duration: Time for a condition to persist before node recovery starts.type: Condition type.status: Expected status for recognizing a node as unhealthy.
In the above example, the NodeHealthCheck controller will initiate recovery if the
Readycondition type for a node indicates that it is unavailable or in a down state for 60 seconds. -
-
Navigate to the directory with the file and run this command:
kubectl apply -f <file_name>
-
- Node failure detection.
- Recovery of affected workload by the scheduler.
- Remediation of nodes after failures.
Yandex Cloud technical support is available 24/7. The types of requests you can submit and the relevant response times depend on your pricing plan. You can switch to the paid support plan in the management console. You can learn more about the technical support terms and conditions here.
| Helm chart | Version | Pull-command | Documentation |
|---|---|---|---|
| yandex-cloud/medik8s/kubernetes-node-remediation/chart/kubernetes-node-remediation | 1.0.1 | Open |
| Docker image | Version | Pull-command |
|---|---|---|
| yandex-cloud/medik8s/kubernetes-node-remediation/self-node-remediation-operator1750332903325272502991409697328138047197170598960 | v0.10.0-y | |
| yandex-cloud/medik8s/kubernetes-node-remediation/node-healthcheck-operator1750332903325272502991409697328138047197170598960 | v0.7.0 | |
| yandex-cloud/medik8s/kubernetes-node-remediation/kube-rbac-proxy1750332903325272502991409697328138047197170598960 | v0.19.0 |