Overview
A bug in older versions of DRBD and LINSTOR can cause a storage node to incorrectly treat a freshly-provisioned replica as already up to date, skipping the data synchronization that should occur when a node is added or replaced. If this happens, your application could silently read corrupted or stale data without any error being reported.
The root cause is that DRBD tracks which replica holds the most recent data using internal version identifiers. Volumes created with older software may never have had those identifiers updated after the initial write, leaving them unable to tell a new empty replica apart from an existing one. The bug is fixed in DRBD 9.2.18 / 9.3.2 and LINSTOR 1.33.2 (or 1.34+): upgrading prevents any new volumes from being affected.
Upgrading alone is not enough. Volumes that are already in the affected state need a one-time corrective action to permanently resolve the issue. The steps below walk you through identifying which volumes are affected and how to fix them.
The remediation procedure consists of four steps:
- Identify affected resources using a provided Kubernetes job.
- Review the labeled PersistentVolumes.
- Trigger a UUID bump for each affected volume.
- Verify and clean up.
Step 1: Identify Affected Resources
Apply the detection job below. It spawns a privileged pod on every cluster node, scans all DRBD volumes
for the day-0 UUID condition, and labels any matching PersistentVolume with
linstor.csi.linbit.com/unrotated-uuid=true.
kubectl apply -n linbit-sds -f https://charts.linstor.io/advisories/drbd-unrotated-uuid/find-unrotated-uuids.yaml
Wait for the control job to finish (it cleans up the per-node jobs automatically):
kubectl wait --for=condition=complete --timeout=600s \
job/find-unrotated-uuids-ctrl -n linbit-sds
Inspect the job log to see which resources were found and which PVs were labeled:
kubectl logs -n linbit-sds job/find-unrotated-uuids-ctrl
Step 2: Review Affected PersistentVolumes
List every PV that was labeled by the detection job:
kubectl get pv -l linstor.csi.linbit.com/unrotated-uuid=true
If no PVs are listed, your cluster is not affected and no further action is required.
Step 3: Trigger a UUID Bump
Perform the following procedure for each affected PV. The right approach depends on how many diskful replicas the underlying LINSTOR resource has.
Volumes with two or more diskful replicas (most common)
Gracefully disconnect a secondary replica, wait for one write to reach the primary, then reconnect. The write causes DRBD to rotate the UUID. No application downtime is required.
-
Get the DRBD resource name from the PV:
RESOURCE=$(kubectl get pv <pv-name> -o jsonpath='{.spec.csi.volumeHandle}') echo $RESOURCE -
List the replicas and identify a secondary (non-primary) diskful node:
kubectl exec -n linbit-sds deploy/linstor-controller -- \ linstor resource list -r $RESOURCE -
Disconnect the resource on that node via the LINSTOR satellite pod:
kubectl exec -n linbit-sds ds/linstor-satellite.<NODE> -- drbdadm disconnect $RESOURCE -
Ensure at least one write reaches the primary. If the PVC is actively used by a running workload, any application write is sufficient. Otherwise, exec into a pod that mounts the PVC and write a small amount of data, for example:
kubectl exec <pod> -- sh -c 'dd if=/dev/urandom bs=4k count=1 of=<mountpath>/.uuid-bump && sync && rm <mountpath>/.uuid-bump' -
Reconnect the replica:
kubectl exec -n linbit-sds ds/linstor-satellite.<NODE> -- drbdadm connect $RESOURCE -
Wait for the resync to complete. All replicas must return to
UpToDatebefore proceeding to the next volume:kubectl exec -n linbit-sds deploy/linstor-controller -- \ linstor resource list -r $RESOURCE
Volumes with a single diskful replica
When there is only one diskful replica, toggling to diskless would remove the only copy of the data.
Instead, use drbdadm new-current-uuid directly on the hosting node.
This requires briefly quiescing I/O on the resource.
-
Scale down all workloads (Deployments, StatefulSets, etc.) that use the PVC.
-
Identify the node hosting the diskful replica (from
linstor resource list -r $RESOURCE) and run the following commands on that node as root:kubectl exec -n linbit-sds ds/linstor-satellite.<NODE> -- drbdadm disconnect $RESOURCE kubectl exec -n linbit-sds ds/linstor-satellite.<NODE> -- drbdadm new-current-uuid $RESOURCE/0 kubectl exec -n linbit-sds ds/linstor-satellite.<NODE> -- drbdadm connect $RESOURCE -
Scale the workloads back up. LINSTOR will reconnect the resource automatically.
Step 4: Verify and Clean Up
After remediating all affected volumes, follow these steps to confirm the cluster is clean.
-
Remove the advisory labels so the detection job starts with a clean slate:
kubectl label pv -l linstor.csi.linbit.com/unrotated-uuid=true \ linstor.csi.linbit.com/unrotated-uuid- -
Delete and re-apply the detection job:
kubectl delete -n linbit-sds -f https://charts.linstor.io/advisories/drbd-unrotated-uuid/find-unrotated-uuids.yaml kubectl apply -n linbit-sds -f https://charts.linstor.io/advisories/drbd-unrotated-uuid/find-unrotated-uuids.yaml kubectl wait --for=condition=complete --timeout=600s \ job/find-unrotated-uuids-ctrl -n linbit-sds -
Confirm no PVs were labeled. If any appear, repeat Step 3 for the remaining volumes.
kubectl get pv -l linstor.csi.linbit.com/unrotated-uuid=true -
Remove the detection job and its RBAC resources:
kubectl delete -n linbit-sds -f https://charts.linstor.io/advisories/drbd-unrotated-uuid/find-unrotated-uuids.yaml
Suspected Past Corruption
The detection job identifies volumes that are currently in the vulnerable state. It cannot determine whether corruption already occurred on a volume whose UUID has since been rotated by normal operation (for example, after a node was replaced). If you have reason to believe that a node was re-provisioned while running an affected DRBD and LINSTOR version, please contact LINBIT support for guidance.