Kubernetes Node Remediation

Updated December 10, 2025

Kubernetes Node Remediation is a tool comprising several Kubernetes operators that provide automatic recovery of failed Managed Service for Kubernetes cluster nodes and high availability for stateful workloads.

The solution features two controllers:

  • Node Healthcheck Controller, which tracks failures.
  • Self Node Remediation Controller, which moves the workload away from unhealthy nodes and restores them.
Deployment instructions
  1. Install kubectl and configure it to work with your cluster.

  2. Create a node group for Kubernetes Node Remediation.

  3. Configure the application:

    • Namespace: Create a new namespace, e.g., remediation-space. If you leave the default namespace, Kubernetes Node Remediation may work incorrectly.
    • Application name: Specify the application name.
  4. Click Install.

  5. Wait for the application to change its status to Deployed.

  6. Create the NodeHealthCheck resource:

    1. Create a file with the NodeHealthCheck description:

      apiVersion: remediation.medik8s.io/v1alpha1
      kind: NodeHealthCheck
      metadata:
        name: nodehc-sample
      spec:
        minHealthy: 51%
        remediationTemplate:
          apiVersion: self-node-remediation.medik8s.io/v1alpha1
          kind: SelfNodeRemediationTemplate
          name: self-node-remediation-automatic-strategy-template
          namespace: <application_namespace>
        selector:
          matchLabels:
            beta.kubernetes.io/os: linux
        unhealthyConditions:
        - duration: 60s
          status: "False"
          type: Ready
        - duration: 60s
          status: Unknown
          type: Ready
      

      Where:

      • spec.minHealthy: Minimum percentage of healthy nodes required to initiate recovery.

      • spec.unhealthyConditions: List of node status conditions the controller uses to determine if the node is unhealthy.

        • duration: Time for a condition to persist before node recovery starts.
        • type: Condition type.
        • status: Expected status for recognizing a node as unhealthy.

      In the above example, the NodeHealthCheck controller will initiate recovery if the Ready condition type for a node indicates that it is unavailable or in a down state for 60 seconds.

      Learn more about resource fields.

    2. Navigate to the directory with the file and run this command:

      kubectl apply -f <file_name>
      
Billing type
Free
Type
Kubernetes® Application
Category
Admin tools and DevOps
Publisher
Yandex Cloud
Use cases
  • Node failure detection.
  • Recovery of affected workload by the scheduler.
  • Remediation of nodes after failures.
Technical support

Yandex Cloud technical support is available 24/7. The types of requests you can submit and the relevant response times depend on your pricing plan. You can switch to the paid support plan in the management console. You can learn more about the technical support terms and conditions here.

Product composition
Helm chartVersion
Pull-command
Documentation
yandex-cloud/medik8s/kubernetes-node-remediation/chart/kubernetes-node-remediation1.0.1Open
Docker imageVersion
Pull-command
yandex-cloud/medik8s/kubernetes-node-remediation/self-node-remediation-operator1750332903325272502991409697328138047197170598960v0.10.0-y
yandex-cloud/medik8s/kubernetes-node-remediation/node-healthcheck-operator1750332903325272502991409697328138047197170598960v0.7.0
yandex-cloud/medik8s/kubernetes-node-remediation/kube-rbac-proxy1750332903325272502991409697328138047197170598960v0.19.0
Terms
By using this product you agree to the Yandex Cloud Marketplace Terms of Service
Billing type
Free
Type
Kubernetes® Application
Category
Admin tools and DevOps
Publisher
Yandex Cloud