PVC stuck in Pending when CSINode is deleted during provionning phase

**What happened**:

I got a PVC stuck in an infinite retry loop during Provisioning which I think is caused by the changes introduced in https://github.com/kubernetes-csi/external-provisioner/pull/1413

The PVC had the following events:
```
Events:
  Type     Reason                Age                    From                                                                                          Message
  ----     ------                ----                   ----                                                                                          -------
  Normal   WaitForFirstConsumer  4h3m (x2 over 4h3m)    persistentvolume-controller                                                                   waiting for first consumer to be created before binding
  Normal   ExternalProvisioning  177m (x263 over 4h3m)  persistentvolume-controller                                                                   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Normal   ExternalProvisioning  68m (x421 over 173m)   persistentvolume-controller                                                                   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
  Warning  ProvisioningFailed    64m (x56 over 4h3m)    ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6  error generating accessibility requirements: no topology key found for node ip-10-150-80-32.us-west-2.compute.internal
  Normal   Provisioning          4m34s (x72 over 4h3m)  ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6  External provisioner is provisioning volume for claim "cass-ukv-xpq-multi-step-execution/server-data-ukv-xpq-multi-step-execution-dc1-1c19-ra-sts-0"
  Warning  ProvisioningFailed    4m34s (x16 over 59m)   ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6  error generating accessibility requirements: failed to get selected CSINode ip-10-150-80-32.us-west-2.compute.internal: csinode.storage.k8s.io "ip-10-150-80-32.us-west-2.compute.internal" not found
  Normal   ExternalProvisioning  63s (x261 over 66m)    persistentvolume-controller                                                                   Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
```

From the audit logs I gathered the following timeline:
1. a CSI node pod is restarted on a node and the new pod fails to start (reason unrelated to the bug) 

<img width="4008" height="624" alt="Image" src="https://github.com/user-attachments/assets/7566fe28-4b5e-4921-a2d4-90052df55ae5" />

2. at the same time, the CSINode object gets cleared of the driver because the CSI node is unable to register itself  (expected)

<img width="4008" height="1603" alt="Image" src="https://github.com/user-attachments/assets/3d3372af-7a98-4c3a-97ad-d946504c9038" />

3. Multiple hours later, a pod gets scheduled onto the node and the CSI provisioner attempts to provision the PV but fails due to the still missing topology (expected)

<img width="4115" height="952" alt="Image" src="https://github.com/user-attachments/assets/bf39d491-7912-4883-be05-f5da08e09a0b" />

4. A few hours later the node object get deleted alongside the CSINode (expected)

<img width="4193" height="827" alt="Image" src="https://github.com/user-attachments/assets/02b5ea42-788b-4bef-a0d9-d802dd258f1b" />

5. events for the PVC provisioning become `ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6  error generating accessibility requirements: failed to get selected CSINode ip-10-150-80-32.us-west-2.compute.internal: csinode.storage.k8s.io "ip-10-150-80-32.us-west-2.compute.internal" not found` due to the missing CSINode and they never change (**unexpected**)

6. I recovered the situation by manually deleting the `selected-node` annotation to let the Pod+PVC be rescheduled somewhere else

**What you expected to happen**:

Once the CSINode and node objects have been deleted and removed from the cluster, I expect the CSI provisioner to recover the situation and delete the `selected-node` annotation by itself like it used to do before [this change](https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/pull/194/files#diff-3c5bb5f48211873c58fcba055dcae2ac7b1958969219e06e1508d76d485dace7L1495-L1498) 

**How to reproduce it**:

I don't have an exact reproducer but I think you can reproduce by:
1. Force a long volume provisioning (for instance by removing the CSI node pod so that it gets deregistered by the kubelet and there is no more topology keys)
2. Schedule a pod with a PVC on the node with the missing topology
3. Delete Node and CSINode objects
4. The PVC should now be stuck until something removes the `selected-node` annotation

**Anything else we need to know?**:

**Environment**:
- Driver version: aws-ebs-csi v1.52.0 with csi-provisioner v6.0.0
- Kubernetes version (use `kubectl version`): v1.33
- OS (e.g. from /etc/os-release):
- Kernel (e.g. `uname -a`):
- Install tools:
- Others:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PVC stuck in Pending when CSINode is deleted during provionning phase #1437

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PVC stuck in Pending when CSINode is deleted during provionning phase #1437

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions