Skip to content

Fix webhook denial when ancestor owner is not found during GC teardown#10009

Open
iaalm wants to merge 1 commit intokubernetes-sigs:mainfrom
iaalm:main
Open

Fix webhook denial when ancestor owner is not found during GC teardown#10009
iaalm wants to merge 1 commit intokubernetes-sigs:mainfrom
iaalm:main

Conversation

@iaalm
Copy link

@iaalm iaalm commented Mar 19, 2026

What type of PR is this?

/kind bug
/area integrations

What this PR does / why we need it:

When the Garbage Collector PATCHes a child object to remove the foregroundDeletion finalizer, Kueue's mutating webhooks invoke WorkloadShouldBeSuspended → FindAncestorJobManagedByKueue to walk the ownerReference chain. If any owner in that chain has already been deleted (e.g. via background/helm uninstall while a foreground deletion is in progress on a child), c.Get returns NotFound and the function returns ErrWorkloadOwnerNotFound. This error propagates up and causes the webhook to deny the PATCH, permanently blocking the GC from removing the finalizer and leaving objects stuck in Terminating state.

This is a generalisation of the problem partially addressed in #8862. That fix added a managedByAnotherFramework() guard based on the kueue.x-k8s.io/pod-suspending-parent annotation, but objects created directly by controllers (e.g. StatefulSets created by the LWS controller without going through Kueue's webhook path) never receive this annotation, so the guard does not help them.

Fix: treat ErrWorkloadOwnerNotFound as "no Kueue-managed ancestor" (return false, nil) inside WorkloadShouldBeSuspended. A missing owner during a webhook admission call means the ownership chain is being garbage collected — there is nothing to suspend, and the operation should be allowed.

The fix is applied at two call sites:

  • WorkloadShouldBeSuspended in pkg/controller/jobframework/defaults.go — covers all webhook Default() / ValidateUpdate() paths, plus the StatefulSet and LeaderWorkerSet reconcilers' use of this function.
  • The pod webhook's direct call to FindAncestorJobManagedByKueue in pkg/controller/jobs/pod/pod_webhook.go.

The main reconciler's direct call to FindAncestorJobManagedByKueue in reconciler.go is intentionally left unchanged: requeuing on ErrWorkloadOwnerNotFound there is correct, since during normal operation a parent may transiently be absent due to creation ordering.

Which issue(s) this PR fixes:

Fixes # n/a

Special notes for your reviewer:

The scenario that triggers this bug requires a mix of foreground and background deletion in the same ownership chain (e.g. LWS → Parent STS → Leader Pod → Child STS → Worker Pods where a pod restart causes foreground deletion of the leader pod cascading foregroundDeletion finalizers down to the child STS, while helm uninstall concurrently removes the parent STS via background deletion). With a purely foreground or purely background chain the deadlock does not occur.

Reproducible evidence from a live cluster:

$ kubectl patch sts <name> -n <ns> --type=merge -p '{"metadata":{"finalizers":null}}' --dry-run=server
Error: admission webhook "mstatefulset.kb.io" denied the request: workload owner not found

Does this PR introduce a user-facing change?

no

None

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. area/integrations Workload integrations labels Mar 19, 2026
@netlify
Copy link

netlify bot commented Mar 19, 2026

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 22fe5d4
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69bd46242e8c2c0008cb69aa

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 19, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: iaalm / name: Kaisheng Xu (22fe5d4)

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: iaalm
Once this PR has been reviewed and has the lgtm label, please assign mimowo, pbundyra for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @iaalm!

It looks like this is your first PR to kubernetes-sigs/kueue 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kueue has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @iaalm. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 19, 2026
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 19, 2026
@iaalm iaalm marked this pull request as ready for review March 19, 2026 13:16
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2026
@k8s-ci-robot k8s-ci-robot requested a review from olekzabl March 19, 2026 13:16
@iaalm iaalm closed this Mar 19, 2026
@iaalm iaalm reopened this Mar 19, 2026
@mimowo
Copy link
Contributor

mimowo commented Mar 19, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 19, 2026
@mimowo
Copy link
Contributor

mimowo commented Mar 19, 2026

From PR description:

A missing owner during a webhook admission call means the ownership chain is being garbage collected

I'm not sure this is accurate. I think another possibility is that the informer in the Kueue's webhook does not yet know the parent is already created. This may happen because there is no guarantee on the order of notifications about objects of different Kinds. So, it might be the chain is yet being created, and we would incorrectly say "don't suspend", letting the Pod bypass quota checks by Kueue.

Let me know if I'm missing something, but I'm thinking we could skip checking the parent if we already know we are managed by Kueue - by the presence of some Kueue specific annotations like suspended-by-parent

@iaalm
Copy link
Author

iaalm commented Mar 19, 2026

I think another possibility is that the informer in the Kueue's webhook does not yet know the parent is already created.

That make sense.

but I'm thinking we could skip checking the parent if we already know we are managed by Kueue - by the presence of some Kueue specific annotations like suspended-by-parent

In my case, this LWS is not managed by kueue, that's why only with #8862 not working for me. I'm thinking maybe I can use DeletionTimestamp to distinguish them. The GC only patch a child object to remove foregroundDeletion when the object is already in Terminating state — so DeletionTimestamp is always non-zero. This should never happen during creation.

Let me try refine my fix..

@mimowo
Copy link
Contributor

mimowo commented Mar 19, 2026

In my case, this LWS is not managed by kueue,

Interesting, tell me more to better understand the setup, and thus advise on the fix. So, you use LWS which is not managed by Kueue, yet its Pods are managed by Kueue? Or you have another custom CRD managing the Pods, or maybe Pods managed by a controller outside of K8s entirely?

@iaalm
Copy link
Author

iaalm commented Mar 19, 2026

Actually we're doing some migration and just started to use Kueue. Currently we use KAI scheduler, Kueue at same time. Some job are managed by Kueue and some not (in the same namespace😂). It's kind of a mess but it's a situation I have to face. These LWS having this issue are created by some helm and not have queue name label on its lws sts and pod, while I thought kueue should not impact on them without that label.

@mimowo
Copy link
Contributor

mimowo commented Mar 19, 2026

These LWS having this issue are created by some helm and not have queue name label on its lws sts and pod,

I see, so the Pods live in a namespace managed by Kueue, yet they are not managed by Kueue because no "queue-name" label. As a result, Kueue tries to walk the path repeatedly to determine if managed by Kueue or not. I will think how support it, but maybe in the meanwhile @mbobrovskyi has some good ideas who worked on fixing for LWS and STS, when managed by Kueue.

@mimowo
Copy link
Contributor

mimowo commented Mar 19, 2026

What if you use managedByAnotherFramework and set on such Pods kueue.x-k8s.io/suspended-by-parent: custom-framework. I think then Kueue should skip processing them, and skip the ownership traversal. Have you tried that? IIRC one of the teams we collab with does something like this,

@iaalm
Copy link
Author

iaalm commented Mar 19, 2026

What if you use managedByAnotherFramework and set on such Pods kueue.x-k8s.io/suspended-by-parent: custom-framework.

This is a valid approach and would work if LWS sets the annotation. However, the root cause of this issue is precisely that LWS-created StatefulSets never receive kueue.x-k8s.io/pod-suspending-parent — they're not created through Kueue's webhook path, so no annotation is injected. That's also why #8862's fix didn't cover this case.

The DeletionTimestamp check is a Kueue-side fix that works without requiring changes in LWS or any other external framework. It's also semantically precise: a missing owner is only silently ignored when the object itself is already being deleted (i.e., genuinely in GC teardown), which avoids the cache-lag concern you raised earlier.

That said, if LWS can be updated to set the annotation, that would also be a correct fix — and these two approaches are complementary, not mutually exclusive.

@mimowo
Copy link
Contributor

mimowo commented Mar 19, 2026

This is a valid approach and would work if LWS sets the annotation. However, the root cause of this issue is precisely that LWS-created StatefulSets never receive kueue.x-k8s.io/pod-suspending-parent — they're not created through Kueue's webhook path, so no annotation is injected. That's also why #8862 fix didn't cover this case.

Yeah, but as a quick fix did you consider having a webhook which injects the kueue.x-k8s.io/pod-suspending-parent annotation into the PodTemplate of the LWS or StatefulSet on their creation? Those LWS or StatefulSet which are not managed by Kueue?

The DeletionTimestamp check is a Kueue-side fix that works without requiring changes in LWS or any other external framework. It's also semantically precise: a missing owner is only silently ignored when the object itself is already being deleted (i.e., genuinely in GC teardown), which avoids the cache-lag concern you raised earlier.

Indeed, there is only one potential race, that maybe Kueue informer already knows about the DeletionTimestamp - so doesn't "suspend", while the other controller, say k8s Job hasn't seen yet the event about DeletionTimestamp, so it may start still creating Pods. I'm wodering how we could close this gap.

That said, if LWS can be updated to set the annotation, that would also be a correct fix — and these two approaches are complementary, not mutually exclusive.

Totally agree, we should improve Kueue's handling of such cases without the need to explicitly opt-out. I'm thinking about mitigation you could apply even already.

@iaalm
Copy link
Author

iaalm commented Mar 19, 2026

Will try fix test tomorrow, if you agree with the solution

@mimowo
Copy link
Contributor

mimowo commented Mar 20, 2026

Will try fix test tomorrow, if you agree with the solution

I think this unfortunately remains a fundamantal flaw in this approach: "Indeed, there is only one potential race, that maybe Kueue informer already knows about the DeletionTimestamp - so doesn't "suspend", while the other controller, say k8s Job hasn't seen yet the event about DeletionTimestamp, so it may start still creating Pods."

Yes, this is unlikely but when it happens it could degrade user experience for some users with "regular setup" where there Job is immediately deleted for some reason.

I'm not sure how to solve it properly yet

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 20, 2026
When foreground and background deletion are mixed in the same ownership
chain (e.g. LWS → Parent STS → Leader Pod → Child STS → Worker Pods),
child objects can get permanently stuck in Terminating state. The GC
sets foregroundDeletion finalizers on children during foreground
deletion, but if a parent is concurrently removed via background
deletion (e.g. helm uninstall), the GC's PATCH to remove the finalizer
from a child is denied by Kueue's mutating webhook.

The denial happens because the webhook calls WorkloadShouldBeSuspended
→ FindAncestorJobManagedByKueue, which walks the ownerReference chain.
When it tries to fetch the deleted parent, c.Get returns NotFound,
which previously propagated as ErrWorkloadOwnerNotFound and caused the
webhook to reject the PATCH.

Fix at two layers:

1. Webhook early-exit: in each webhook's Default() and ValidateUpdate(),
   skip the WorkloadShouldBeSuspended call entirely when the object
   being admitted already has a DeletionTimestamp. An object in
   Terminating state should not be re-evaluated for suspension or have
   new Kueue annotations applied.

2. Safety net in FindAncestorJobManagedByKueue: if the parent lookup
   returns NotFound AND the object being processed itself has a
   DeletionTimestamp, treat this as normal GC teardown rather than an
   error. This covers call sites that do not go through the webhook
   early-exit (e.g. the reconciler path).

The DeletionTimestamp of the object being processed — not the parent's
— is the discriminator. This avoids the informer-lag race where a
creating controller hasn't yet seen a parent's DeletionTimestamp and
still submits new children: those new children have no DeletionTimestamp
themselves, so neither guard fires and ErrWorkloadOwnerNotFound is still
returned, causing the admission to be retried until the cache catches up.
@iaalm
Copy link
Author

iaalm commented Mar 20, 2026

Indeed, there is only one potential race, that maybe Kueue informer already knows about the DeletionTimestamp - so doesn't "suspend", while the other controller, say k8s Job hasn't seen yet the event about DeletionTimestamp, so it may start still creating Pods.

I think this concern may not apply to the current implementation, because both guards check the DeletionTimestamp of the admitted object itself, not the parent's.

A newly created Pod (from a controller that hasn't yet seen the parent's deletion event) has DeletionTimestamp = nil. Neither guard fires for it — it goes through the normal webhook path. If the parent is already gone (NotFound) and the new Pod has no DeletionTimestamp, FindAncestorJobManagedByKueue still returns ErrWorkloadOwnerNotFound, the webhook rejects the admission, and the controller retries until its cache catches up.

The guards only fire when the admitted object itself is already in Terminating state (DeletionTimestamp set by the API server upon DELETE). That can only happen via an explicit delete, not because a creating controller hasn't yet seen a parent's deletion. The two cases are mutually exclusive: an object is either being newly created (DeletionTimestamp = nil) or already in teardown (DeletionTimestamp ≠ nil).

Is there a specific scenario you have in mind where a newly created object would reach our guards? Happy to trace through it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/integrations Workload integrations cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants