Recover from disk-failure

**Describe the problem/challenge you have**
Currently it is a very manual and error-prone task to recover from a disk failure with lvm-localpv, = replacing a failed disk

Imagine the following scenario

* Clustered workloads like opensearch, mysql-galery, etc. deployed as statefulset on different K8S nodes (for simplicity consider a three node cluster)
* Since there is HA/repliction at application level there is no need for replicating the disks e.g by Hardware-RAID (or by using mayastor)
* If one disk fails the application continues to run - if a POD that was backed by the failed disk comes back online the data is rebuilt from the remaining cluster-members.

But this wont't work because the POD backed by the faulty disk is stuck. It won't come up because its PV (and its LV) is not available.

The current "solution" is like this 

* Recreate all missing/required LVs manually on the affected node
* This requires finding out wich LVs are needed, e.g. by running `kubectl get lvmvolumes.local.openebs.io`. There are automatic backups of volumegroup metadata, but since they are taken BEFORE any operation is performed, they are not consistent with the latest running configuration, and cannot be used for a metadata-restore.
* There might be left-overs of the old LVs in the OS that have to be cleaned up

If this is done, everything is fine. lvm-localpv uses the new LVs, creates a filesystem and mounts the PV. From this point onward the application takes care of the data recovery.

**Describe the solution you'd like**
The localpv-lvm provisioner could be used to recreate missing LVs. The provisioner knows which LVs should exist on each node, and if they don't, it could recreate missing ones and make sure that "leftovers" of the old LVs are removed. This "auto-recreate-option" could be an option (disabled by default) in a storage-class.

It would be pretty cool if the recovery-operations could be supported by a kubectl plugin. Like the mayastor plugin there could be a localpv-lvm plugin, that knows about all PV/PVCs and the node/volume-group that is backing each PV.  And by issuing a command like `kubectl localpv-lvm recreate <PV-ID>` and/or a combination of `kubectl localpv-lvm get volume -l openebs.io/nodename=node1,openebs.io/volgroup=openebs_vg0 | kubectl localpv-lvm recreate` the recovery-process could be very simple-

**Environment:**
- LVM Driver version v2.8.0
- Kubernetes version (use `kubectl version`): v1.31.1
- Kubernetes installer & version:  kubeadm v1.31.1
- Cloud provider or hardware configuration: Bare-Metal (and VMs for testing) 
- OS (e.g. from `/etc/os-release`): AlmaLinux 9.3  / AlmaLinux 9.5


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from disk-failure #385

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Recover from disk-failure #385

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions