Skip to content

PVC stuck in Pending when CSINode is deleted during provionning phaseΒ #1437

@Fricounet

Description

@Fricounet

What happened:

I got a PVC stuck in an infinite retry loop during Provisioning which I think is caused by the changes introduced in #1413

The PVC had the following events:

Events:
  Type     Reason                Age                    From                                                                                          Message
  ----     ------                ----                   ----                                                                                          -------
  Normal   WaitForFirstConsumer  4h3m (x2 over 4h3m)    persistentvolume-controller                                                                   waiting for first consumer to be created before binding
  Normal   ExternalProvisioning  177m (x263 over 4h3m)  persistentvolume-controller                                                                   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal   ExternalProvisioning  68m (x421 over 173m)   persistentvolume-controller                                                                   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Warning  ProvisioningFailed    64m (x56 over 4h3m)    ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6  error generating accessibility requirements: no topology key found for node ip-10-150-80-32.us-west-2.compute.internal
  Normal   Provisioning          4m34s (x72 over 4h3m)  ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6  External provisioner is provisioning volume for claim "cass-ukv-xpq-multi-step-execution/server-data-ukv-xpq-multi-step-execution-dc1-1c19-ra-sts-0"
  Warning  ProvisioningFailed    4m34s (x16 over 59m)   ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6  error generating accessibility requirements: failed to get selected CSINode ip-10-150-80-32.us-west-2.compute.internal: csinode.storage.k8s.io "ip-10-150-80-32.us-west-2.compute.internal" not found
  Normal   ExternalProvisioning  63s (x261 over 66m)    persistentvolume-controller                                                                   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.

From the audit logs I gathered the following timeline:

  1. a CSI node pod is restarted on a node and the new pod fails to start (reason unrelated to the bug)
Image
  1. at the same time, the CSINode object gets cleared of the driver because the CSI node is unable to register itself (expected)
Image
  1. Multiple hours later, a pod gets scheduled onto the node and the CSI provisioner attempts to provision the PV but fails due to the still missing topology (expected)
Image
  1. A few hours later the node object get deleted alongside the CSINode (expected)
Image
  1. events for the PVC provisioning become ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6 error generating accessibility requirements: failed to get selected CSINode ip-10-150-80-32.us-west-2.compute.internal: csinode.storage.k8s.io "ip-10-150-80-32.us-west-2.compute.internal" not found due to the missing CSINode and they never change (unexpected)

  2. I recovered the situation by manually deleting the selected-node annotation to let the Pod+PVC be rescheduled somewhere else

What you expected to happen:

Once the CSINode and node objects have been deleted and removed from the cluster, I expect the CSI provisioner to recover the situation and delete the selected-node annotation by itself like it used to do before this change

How to reproduce it:

I don't have an exact reproducer but I think you can reproduce by:

  1. Force a long volume provisioning (for instance by removing the CSI node pod so that it gets deregistered by the kubelet and there is no more topology keys)
  2. Schedule a pod with a PVC on the node with the missing topology
  3. Delete Node and CSINode objects
  4. The PVC should now be stuck until something removes the selected-node annotation

Anything else we need to know?:

Environment:

  • Driver version: aws-ebs-csi v1.52.0 with csi-provisioner v6.0.0
  • Kubernetes version (use kubectl version): v1.33
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions