Skip to content

NodeUnpublishVolume reports success after EBUSY, leaving stale mount and breaking next NodePublishVolume #461

@Liamlu28

Description

@Liamlu28

Bug report

We hit a stale mount issue with openebs/lvm-driver:1.8.0 (local.csi.openebs.io) on a Kubernetes production cluster.

The main problem is:

  • NodeUnpublishVolume hits device or resource busy
  • but the driver still logs umount done / has been unmounted
  • the old mount actually remains on the host
  • the next NodePublishVolume then fails with:
    verifyMount: device already mounted at ...

This leaves StatefulSet rollouts stuck until we manually clean the stale mount from the host mount namespace.

Environment

  • Driver: docker.io/openebs/lvm-driver:1.8.0
  • CSI driver: local.csi.openebs.io
  • StorageClass: localssd
  • Access mode: RWOP
  • Node-local LVM volumes
  • Kubernetes: production cluster v1.33.7
  • Workload type: StatefulSet rolling update

What happened

During a rolling update, old pods were deleted, and replacement pods were scheduled on the same node.

The driver received NodeUnpublishVolume for the old pod path, but unmount cleanup was not completed correctly. The driver then proceeded as if unpublish had succeeded.

Immediately after that, NodePublishVolume for the replacement pod failed because the device was still mounted at the old pod path.

This blocked the new pod from mounting the same PVC.

Relevant logs

NodeUnpublishVolume

GRPC call: /csi.v1.Node/NodeUnpublishVolume requests {"target_path":"/var/lib/kubelet/pods/9c76d145-5a55-45ac-b13c-b2323e62c048/volumes/kubernetes.io~csi/pvc-96e56831-c0c1-4c70-b06c-709dd5806565/mount","volume_id":"pvc-96e56831-c0c1-4c70-b06c-709dd5806565"}

E... mount.go:136] lvm: failed to remove mount path vol pvc-96e56831-c0c1-4c70-b06c-709dd5806565 err : remove /var/lib/kubelet/pods/9c76d145-5a55-45ac-b13c-b2323e62c048/volumes/kubernetes.io~csi/pvc-96e56831-c0c1-4c70-b06c-709dd5806565/mount: device or resource busy

I... mount.go:139] umount done pvc-96e56831-c0c1-4c70-b06c-709dd5806565 path /var/lib/kubelet/pods/9c76d145-5a55-45ac-b13c-b2323e62c048/volumes/kubernetes.io~csi/pvc-96e56831-c0c1-4c70-b06c-709dd5806565/mount

I... agent.go:318] hostpath: volume pvc-96e56831-c0c1-4c70-b06c-709dd5806565 path: /var/lib/kubelet/pods/9c76d145-5a55-45ac-b13c-b2323e62c048/volumes/kubernetes.io~csi/pvc-96e56831-c0c1-4c70-b06c-709dd5806565/mount has been unmounted.

Next NodePublishVolume

GRPC call: /csi.v1.Node/NodePublishVolume requests {"target_path":"/var/lib/kubelet/pods/270b71b0-cd17-4719-a2cc-f6334faec8be/volumes/kubernetes.io~csi/pvc-96e56831-c0c1-4c70-b06c-709dd5806565/mount","volume_id":"pvc-96e56831-c0c1-4c70-b06c-709dd5806565"}

E... mount.go:186] can not mount, volume:pvc-96e56831-c0c1-4c70-b06c-709dd5806565 already mounted dev /dev/mapper/vg_data-pvc--96e56831--c0c1--4c70--b06c--709dd5806565 mounts: [/var/lib/kubelet/pods/9c76d145-5a55-45ac-b13c-b2323e62c048/volumes/kubernetes.io~csi/pvc-96e56831-c0c1-4c70-b06c-709dd5806565/mount ...]

E... grpc.go:79] GRPC error: rpc error: code = Internal desc = verifyMount: device already mounted at [/var/lib/kubelet/pods/9c76d145-5a55-45ac-b13c-b2323e62c048/volumes/kubernetes.io~csi/pvc-96e56831-c0c1-4c70-b06c-709dd5806565/mount ...]


## Why we believe this is a driver bug
The driver logs an unmount/remove error (device or resource busy) but still proceeds to log success (umount done, has been unmounted).

In reality, the mount remained present in the host mount namespace, and we had to manually clean it with host-level umount -l.

That suggests NodeUnpublishVolume may return success even though cleanup is incomplete.

## Workaround used
We recovered by manually entering the host mount namespace and lazily unmounting the stale old pod mount path, then retrying the pod rollout.

This is not safe/acceptable as an operational requirement in production.

## Expected behavior
If unpublish cleanup fails with EBUSY / stale mount still present, the driver should:

return an error from NodeUnpublishVolume
not report unmount success
avoid allowing the next NodePublishVolume to proceed into a stale-mount state
## Questions
Is this a known issue in lvm-driver 1.8.0?
Has this been fixed in a newer release?
Should NodeUnpublishVolume fail hard on device or resource busy instead of logging success?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions