Fix webhook denial when ancestor owner is not found during GC teardown#10009
Fix webhook denial when ancestor owner is not found during GC teardown#10009iaalm wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
|
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: iaalm The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @iaalm! |
|
Hi @iaalm. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
From PR description:
I'm not sure this is accurate. I think another possibility is that the informer in the Kueue's webhook does not yet know the parent is already created. This may happen because there is no guarantee on the order of notifications about objects of different Kinds. So, it might be the chain is yet being created, and we would incorrectly say "don't suspend", letting the Pod bypass quota checks by Kueue. Let me know if I'm missing something, but I'm thinking we could skip checking the parent if we already know we are managed by Kueue - by the presence of some Kueue specific annotations like |
That make sense.
In my case, this LWS is not managed by kueue, that's why only with #8862 not working for me. I'm thinking maybe I can use Let me try refine my fix.. |
Interesting, tell me more to better understand the setup, and thus advise on the fix. So, you use LWS which is not managed by Kueue, yet its Pods are managed by Kueue? Or you have another custom CRD managing the Pods, or maybe Pods managed by a controller outside of K8s entirely? |
|
Actually we're doing some migration and just started to use Kueue. Currently we use KAI scheduler, Kueue at same time. Some job are managed by Kueue and some not (in the same namespace😂). It's kind of a mess but it's a situation I have to face. These LWS having this issue are created by some helm and not have queue name label on its lws sts and pod, while I thought kueue should not impact on them without that label. |
I see, so the Pods live in a namespace managed by Kueue, yet they are not managed by Kueue because no "queue-name" label. As a result, Kueue tries to walk the path repeatedly to determine if managed by Kueue or not. I will think how support it, but maybe in the meanwhile @mbobrovskyi has some good ideas who worked on fixing for LWS and STS, when managed by Kueue. |
|
What if you use |
This is a valid approach and would work if LWS sets the annotation. However, the root cause of this issue is precisely that LWS-created StatefulSets never receive kueue.x-k8s.io/pod-suspending-parent — they're not created through Kueue's webhook path, so no annotation is injected. That's also why #8862's fix didn't cover this case. The DeletionTimestamp check is a Kueue-side fix that works without requiring changes in LWS or any other external framework. It's also semantically precise: a missing owner is only silently ignored when the object itself is already being deleted (i.e., genuinely in GC teardown), which avoids the cache-lag concern you raised earlier. That said, if LWS can be updated to set the annotation, that would also be a correct fix — and these two approaches are complementary, not mutually exclusive. |
Yeah, but as a quick fix did you consider having a webhook which injects the
Indeed, there is only one potential race, that maybe Kueue informer already knows about the DeletionTimestamp - so doesn't "suspend", while the other controller, say k8s Job hasn't seen yet the event about DeletionTimestamp, so it may start still creating Pods. I'm wodering how we could close this gap.
Totally agree, we should improve Kueue's handling of such cases without the need to explicitly opt-out. I'm thinking about mitigation you could apply even already. |
|
Will try fix test tomorrow, if you agree with the solution |
I think this unfortunately remains a fundamantal flaw in this approach: "Indeed, there is only one potential race, that maybe Kueue informer already knows about the DeletionTimestamp - so doesn't "suspend", while the other controller, say k8s Job hasn't seen yet the event about DeletionTimestamp, so it may start still creating Pods." Yes, this is unlikely but when it happens it could degrade user experience for some users with "regular setup" where there Job is immediately deleted for some reason. I'm not sure how to solve it properly yet |
When foreground and background deletion are mixed in the same ownership chain (e.g. LWS → Parent STS → Leader Pod → Child STS → Worker Pods), child objects can get permanently stuck in Terminating state. The GC sets foregroundDeletion finalizers on children during foreground deletion, but if a parent is concurrently removed via background deletion (e.g. helm uninstall), the GC's PATCH to remove the finalizer from a child is denied by Kueue's mutating webhook. The denial happens because the webhook calls WorkloadShouldBeSuspended → FindAncestorJobManagedByKueue, which walks the ownerReference chain. When it tries to fetch the deleted parent, c.Get returns NotFound, which previously propagated as ErrWorkloadOwnerNotFound and caused the webhook to reject the PATCH. Fix at two layers: 1. Webhook early-exit: in each webhook's Default() and ValidateUpdate(), skip the WorkloadShouldBeSuspended call entirely when the object being admitted already has a DeletionTimestamp. An object in Terminating state should not be re-evaluated for suspension or have new Kueue annotations applied. 2. Safety net in FindAncestorJobManagedByKueue: if the parent lookup returns NotFound AND the object being processed itself has a DeletionTimestamp, treat this as normal GC teardown rather than an error. This covers call sites that do not go through the webhook early-exit (e.g. the reconciler path). The DeletionTimestamp of the object being processed — not the parent's — is the discriminator. This avoids the informer-lag race where a creating controller hasn't yet seen a parent's DeletionTimestamp and still submits new children: those new children have no DeletionTimestamp themselves, so neither guard fires and ErrWorkloadOwnerNotFound is still returned, causing the admission to be retried until the cache catches up.
I think this concern may not apply to the current implementation, because both guards check the DeletionTimestamp of the admitted object itself, not the parent's. A newly created Pod (from a controller that hasn't yet seen the parent's deletion event) has DeletionTimestamp = nil. Neither guard fires for it — it goes through the normal webhook path. If the parent is already gone (NotFound) and the new Pod has no DeletionTimestamp, FindAncestorJobManagedByKueue still returns ErrWorkloadOwnerNotFound, the webhook rejects the admission, and the controller retries until its cache catches up. The guards only fire when the admitted object itself is already in Terminating state (DeletionTimestamp set by the API server upon DELETE). That can only happen via an explicit delete, not because a creating controller hasn't yet seen a parent's deletion. The two cases are mutually exclusive: an object is either being newly created (DeletionTimestamp = nil) or already in teardown (DeletionTimestamp ≠ nil). Is there a specific scenario you have in mind where a newly created object would reach our guards? Happy to trace through it. |
What type of PR is this?
/kind bug
/area integrations
What this PR does / why we need it:
When the Garbage Collector PATCHes a child object to remove the foregroundDeletion finalizer, Kueue's mutating webhooks invoke WorkloadShouldBeSuspended → FindAncestorJobManagedByKueue to walk the ownerReference chain. If any owner in that chain has already been deleted (e.g. via background/helm uninstall while a foreground deletion is in progress on a child), c.Get returns NotFound and the function returns ErrWorkloadOwnerNotFound. This error propagates up and causes the webhook to deny the PATCH, permanently blocking the GC from removing the finalizer and leaving objects stuck in Terminating state.
This is a generalisation of the problem partially addressed in #8862. That fix added a managedByAnotherFramework() guard based on the kueue.x-k8s.io/pod-suspending-parent annotation, but objects created directly by controllers (e.g. StatefulSets created by the LWS controller without going through Kueue's webhook path) never receive this annotation, so the guard does not help them.
Fix: treat ErrWorkloadOwnerNotFound as "no Kueue-managed ancestor" (return false, nil) inside WorkloadShouldBeSuspended. A missing owner during a webhook admission call means the ownership chain is being garbage collected — there is nothing to suspend, and the operation should be allowed.
The fix is applied at two call sites:
The main reconciler's direct call to FindAncestorJobManagedByKueue in reconciler.go is intentionally left unchanged: requeuing on ErrWorkloadOwnerNotFound there is correct, since during normal operation a parent may transiently be absent due to creation ordering.
Which issue(s) this PR fixes:
Fixes # n/a
Special notes for your reviewer:
The scenario that triggers this bug requires a mix of foreground and background deletion in the same ownership chain (e.g. LWS → Parent STS → Leader Pod → Child STS → Worker Pods where a pod restart causes foreground deletion of the leader pod cascading foregroundDeletion finalizers down to the child STS, while helm uninstall concurrently removes the parent STS via background deletion). With a purely foreground or purely background chain the deadlock does not occur.
Reproducible evidence from a live cluster:
Does this PR introduce a user-facing change?
no