-
Notifications
You must be signed in to change notification settings - Fork 365
Description
What happened:
I got a PVC stuck in an infinite retry loop during Provisioning which I think is caused by the changes introduced in #1413
The PVC had the following events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 4h3m (x2 over 4h3m) persistentvolume-controller waiting for first consumer to be created before binding
Normal ExternalProvisioning 177m (x263 over 4h3m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Normal ExternalProvisioning 68m (x421 over 173m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
Warning ProvisioningFailed 64m (x56 over 4h3m) ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6 error generating accessibility requirements: no topology key found for node ip-10-150-80-32.us-west-2.compute.internal
Normal Provisioning 4m34s (x72 over 4h3m) ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6 External provisioner is provisioning volume for claim "cass-ukv-xpq-multi-step-execution/server-data-ukv-xpq-multi-step-execution-dc1-1c19-ra-sts-0"
Warning ProvisioningFailed 4m34s (x16 over 59m) ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6 error generating accessibility requirements: failed to get selected CSINode ip-10-150-80-32.us-west-2.compute.internal: csinode.storage.k8s.io "ip-10-150-80-32.us-west-2.compute.internal" not found
Normal ExternalProvisioning 63s (x261 over 66m) persistentvolume-controller Waiting for a volume to be created either by the external provisioner 'ebs.csi.aws.com' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
From the audit logs I gathered the following timeline:
- a CSI node pod is restarted on a node and the new pod fails to start (reason unrelated to the bug)
- at the same time, the CSINode object gets cleared of the driver because the CSI node is unable to register itself (expected)
- Multiple hours later, a pod gets scheduled onto the node and the CSI provisioner attempts to provision the PV but fails due to the still missing topology (expected)
- A few hours later the node object get deleted alongside the CSINode (expected)
-
events for the PVC provisioning become
ebs.csi.aws.com_aws-ebs-csi-controller-7b7f86db84-9t2m6_11663768-627d-419a-93ad-24f794b77af6 error generating accessibility requirements: failed to get selected CSINode ip-10-150-80-32.us-west-2.compute.internal: csinode.storage.k8s.io "ip-10-150-80-32.us-west-2.compute.internal" not founddue to the missing CSINode and they never change (unexpected) -
I recovered the situation by manually deleting the
selected-nodeannotation to let the Pod+PVC be rescheduled somewhere else
What you expected to happen:
Once the CSINode and node objects have been deleted and removed from the cluster, I expect the CSI provisioner to recover the situation and delete the selected-node annotation by itself like it used to do before this change
How to reproduce it:
I don't have an exact reproducer but I think you can reproduce by:
- Force a long volume provisioning (for instance by removing the CSI node pod so that it gets deregistered by the kubelet and there is no more topology keys)
- Schedule a pod with a PVC on the node with the missing topology
- Delete Node and CSINode objects
- The PVC should now be stuck until something removes the
selected-nodeannotation
Anything else we need to know?:
Environment:
- Driver version: aws-ebs-csi v1.52.0 with csi-provisioner v6.0.0
- Kubernetes version (use
kubectl version): v1.33 - OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): - Install tools:
- Others: