Diagnosing a disk subsystem

Written by

Yandex Cloud

Updated at April 8, 2026

View in Markdown

Status diagnostics
Common issues
Corrupt disk replacement

This page describes the typical issues of a Stackland disk subsystem and how to fix them.

Status diagnostics

Check the status of volume groups on all nodes:

kubectl get volumegroups -A -o wide

Check synchronization conditions:

kubectl get volumegroups -A -o custom-columns="NAME:.metadata.name,NODE:.spec.nodeName,REASON:.status.conditions[0].reason,STATUS:.status.conditions[0].status"

For detailed diagnostics, connect to the topovgm-operator pod on the relevant node. All LVM utilities are available in the pod:

kubectl -n stackland-volumes exec -it <topovgm_pod_name> -- sh

Run the following LVM commands inside the pod:

pvs   # List of physical volumes
vgs   # List of volume groups
lvs   # List of logical volumes

Common issues

This section lists the typical issues of a disk subsystem and relevant fixes.

VolumeGroupSyncedOnNode = False

The VolumeGroupSyncedOnNode condition in the VolumeGroup resource status is set to False.

The disk is missing or available. Check the physical connection of the disk. Run pvs in the topovgm-operator pod and make sure that all physical volumes are visible.

Volume Group initialization error. Check the operator logs:

kubectl -n stackland-volumes logs -l app.kubernetes.io/name=topovgm-operator --tail=100

The disk contains data and was not auto-detected. Check the status.discoveredDevices field of the VolumeGroup resource: it will state the reason why the disk was excluded.

PhantomDeviceDetected

The PhantomDeviceDetected condition on the node. This means the disk which earlier belonged to the volume group is now temporarily unavailable, but its metadata was preserved in LVM.

Check the physical connection of the disk.
If the disk was reconnected, the operator will automatically update the device-mapper tables and restore the volume group.
If the disk was replaced, follow the corrupt disk replacement procedure.

PVC froze in Pending status

PersistentVolumeClaim is in Pending status for a long time.

Not enough storage space in the volume group of required type. Check for free space:

kubectl get volumegroups -A -o custom-columns="NAME:.metadata.name,NODE:.spec.nodeName,FREE:.status.free,SIZE:.status.size"

A non-existing storage class is specified. Check the storage class name in the PVC manifest. The valid values are stackland-nvme, stackland-ssd, stackland-hdd, and stackland-other.
No disks of required type on the nodes. For example, if stackland-nvme is specified, but there are no NVMe disks, no volume group will be created for this type.

Disk I/O errors

The DiskIOErrors alert triggers in hardware monitoring. Applications report write or read errors.

Identify the defective disk based on the Hardware Monitoring dashboard in Grafana.
If, according to SMART monitoring, the disk is defective, follow the corrupt disk replacement procedure.

Corrupt disk replacement

Warning

Before replacing a disk, make sure its data is backed up or replicated. If the disk has logical volumes containing unreplicated data, this data will be lost.

Step 1: Identify the corrupt disk

Find the volume group with the defective disk:

kubectl get volumegroups -A -o wide

Get detailed status of the volume group:

kubectl describe volumegroup <vg_name> -n stackland-volumes

In the status.physicalVolumes field, find a disk with the attributes attribute containing the m (missing) property.

Connect to the topovgm-operator pod on the relevant node and check the status of its physical volumes:

kubectl -n stackland-volumes exec -it <topovgm_pod_name> -- pvs

Step 2: Transfer data from the disk (if possible)

If the disk is still available and there is data on it, transfer it to other disks in the volume group:

kubectl -n stackland-volumes exec -it <topovgm_pod_name> -- pvmove /dev/<disk_name>

If the disk is completely unavailable, skip this step.

Step 3: Delete the disk from the volume group

If the disk is unavailable, set the following deletion policy in the VolumeGroup resource. To delete without data:

kubectl patch volumegroup <vg_name> -n stackland-volumes --type=merge \
  -p '{"spec":{"deviceLossSynchronizationPolicy":"Remove"}}'

The operator will automatically run vgreduce --removemissing and delete the missing physical volume from the volume group.

When done, set the policy back to its original value:

kubectl patch volumegroup <vg_name> -n stackland-volumes --type=merge \
  -p '{"spec":{"deviceLossSynchronizationPolicy":"Fail"}}'

Step 4: Replace the physical disk

Physically replace the disk in the server as per the manufacturer documentation.

Step 5: Make sure the new disk has been detected

Once the disk is replaced, topovgm-operator will automatically detect the new disk during the next reconciliation cycle.

If auto-detection is on (physicalVolumeSelector not set), the new disk will be added to the volume group automatically.

If an explicit selector is used (physicalVolumeSelector with specific paths), update the VolumeGroup resource by specifying a path to the new disk:

kubectl edit volumegroup <vg_name> -n stackland-volumes

Step 6: Check the recovery

Make sure that the volume group is synchronized:

kubectl get volumegroup <vg_name> -n stackland-volumes -o jsonpath='{.status.conditions[0]}'

The VolumeGroupSyncedOnNode condition should be set to True.

Check the volume group's status in LVM:

kubectl -n stackland-volumes exec -it <topovgm_pod_name> -- vgs

Diagnosing a disk subsystem

Status diagnosticsStatus diagnostics

Common issuesCommon issues

VolumeGroupSyncedOnNode = FalseVolumeGroupSyncedOnNode = False

PhantomDeviceDetectedPhantomDeviceDetected

PVC froze in Pending statusPVC froze in Pending status

Disk I/O errorsDisk I/O errors

Corrupt disk replacementCorrupt disk replacement

Step 1: Identify the corrupt diskStep 1: Identify the corrupt disk

Step 2: Transfer data from the disk (if possible)Step 2: Transfer data from the disk (if possible)

Step 3: Delete the disk from the volume groupStep 3: Delete the disk from the volume group

Step 4: Replace the physical diskStep 4: Replace the physical disk

Step 5: Make sure the new disk has been detectedStep 5: Make sure the new disk has been detected

Step 6: Check the recoveryStep 6: Check the recovery

Was the article helpful?