Fix webhook denial when ancestor owner is not found during GC teardown by iaalm · Pull Request #10009 · kubernetes-sigs/kueue

iaalm · 2026-03-19T12:37:59Z

What type of PR is this?

/kind bug
/area integrations

What this PR does / why we need it:

When the Garbage Collector PATCHes a child object to remove the foregroundDeletion finalizer, Kueue's mutating webhooks invoke WorkloadShouldBeSuspended → FindAncestorJobManagedByKueue to walk the ownerReference chain. If any owner in that chain has already been deleted (e.g. via background/helm uninstall while a foreground deletion is in progress on a child), c.Get returns NotFound and the function returns ErrWorkloadOwnerNotFound. This error propagates up and causes the webhook to deny the PATCH, permanently blocking the GC from removing the finalizer and leaving objects stuck in Terminating state.

This is a generalisation of the problem partially addressed in #8862. That fix added a managedByAnotherFramework() guard based on the kueue.x-k8s.io/pod-suspending-parent annotation, but objects created directly by controllers (e.g. StatefulSets created by the LWS controller without going through Kueue's webhook path) never receive this annotation, so the guard does not help them.

Fix: treat ErrWorkloadOwnerNotFound as "no Kueue-managed ancestor" (return false, nil) inside WorkloadShouldBeSuspended. A missing owner during a webhook admission call means the ownership chain is being garbage collected — there is nothing to suspend, and the operation should be allowed.

The fix is applied at two call sites:

WorkloadShouldBeSuspended in pkg/controller/jobframework/defaults.go — covers all webhook Default() / ValidateUpdate() paths, plus the StatefulSet and LeaderWorkerSet reconcilers' use of this function.
The pod webhook's direct call to FindAncestorJobManagedByKueue in pkg/controller/jobs/pod/pod_webhook.go.

The main reconciler's direct call to FindAncestorJobManagedByKueue in reconciler.go is intentionally left unchanged: requeuing on ErrWorkloadOwnerNotFound there is correct, since during normal operation a parent may transiently be absent due to creation ordering.

Which issue(s) this PR fixes:

Fixes # n/a

Special notes for your reviewer:

The scenario that triggers this bug requires a mix of foreground and background deletion in the same ownership chain (e.g. LWS → Parent STS → Leader Pod → Child STS → Worker Pods where a pod restart causes foreground deletion of the leader pod cascading foregroundDeletion finalizers down to the child STS, while helm uninstall concurrently removes the parent STS via background deletion). With a purely foreground or purely background chain the deadlock does not occur.

Reproducible evidence from a live cluster:

$ kubectl patch sts <name> -n <ns> --type=merge -p '{"metadata":{"finalizers":null}}' --dry-run=server
Error: admission webhook "mstatefulset.kb.io" denied the request: workload owner not found

Does this PR introduce a user-facing change?

no

None

netlify · 2026-03-19T12:38:06Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`22fe5d4`
🔍 Latest deploy log	https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/69bd46242e8c2c0008cb69aa

linux-foundation-easycla · 2026-03-19T12:38:07Z

The committers listed above are authorized under a signed CLA.

✅ login: iaalm / name: Kaisheng Xu (22fe5d4)

k8s-ci-robot · 2026-03-19T12:38:08Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: iaalm
Once this PR has been reviewed and has the lgtm label, please assign mimowo, pbundyra for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-03-19T12:38:08Z

Welcome @iaalm!

It looks like this is your first PR to kubernetes-sigs/kueue 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kueue has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-03-19T12:38:09Z

Hi @iaalm. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mimowo · 2026-03-19T13:25:52Z

/ok-to-test

mimowo · 2026-03-19T13:33:21Z

From PR description:

A missing owner during a webhook admission call means the ownership chain is being garbage collected

I'm not sure this is accurate. I think another possibility is that the informer in the Kueue's webhook does not yet know the parent is already created. This may happen because there is no guarantee on the order of notifications about objects of different Kinds. So, it might be the chain is yet being created, and we would incorrectly say "don't suspend", letting the Pod bypass quota checks by Kueue.

Let me know if I'm missing something, but I'm thinking we could skip checking the parent if we already know we are managed by Kueue - by the presence of some Kueue specific annotations like suspended-by-parent

iaalm · 2026-03-19T13:55:54Z

I think another possibility is that the informer in the Kueue's webhook does not yet know the parent is already created.

That make sense.

but I'm thinking we could skip checking the parent if we already know we are managed by Kueue - by the presence of some Kueue specific annotations like suspended-by-parent

In my case, this LWS is not managed by kueue, that's why only with #8862 not working for me. I'm thinking maybe I can use DeletionTimestamp to distinguish them. The GC only patch a child object to remove foregroundDeletion when the object is already in Terminating state — so DeletionTimestamp is always non-zero. This should never happen during creation.

Let me try refine my fix..

mimowo · 2026-03-19T14:04:05Z

In my case, this LWS is not managed by kueue,

Interesting, tell me more to better understand the setup, and thus advise on the fix. So, you use LWS which is not managed by Kueue, yet its Pods are managed by Kueue? Or you have another custom CRD managing the Pods, or maybe Pods managed by a controller outside of K8s entirely?

iaalm · 2026-03-19T14:17:20Z

Actually we're doing some migration and just started to use Kueue. Currently we use KAI scheduler, Kueue at same time. Some job are managed by Kueue and some not (in the same namespace😂). It's kind of a mess but it's a situation I have to face. These LWS having this issue are created by some helm and not have queue name label on its lws sts and pod, while I thought kueue should not impact on them without that label.

mimowo · 2026-03-19T14:25:03Z

These LWS having this issue are created by some helm and not have queue name label on its lws sts and pod,

I see, so the Pods live in a namespace managed by Kueue, yet they are not managed by Kueue because no "queue-name" label. As a result, Kueue tries to walk the path repeatedly to determine if managed by Kueue or not. I will think how support it, but maybe in the meanwhile @mbobrovskyi has some good ideas who worked on fixing for LWS and STS, when managed by Kueue.

mimowo · 2026-03-19T15:03:19Z

What if you use managedByAnotherFramework and set on such Pods kueue.x-k8s.io/suspended-by-parent: custom-framework. I think then Kueue should skip processing them, and skip the ownership traversal. Have you tried that? IIRC one of the teams we collab with does something like this,

iaalm · 2026-03-19T15:27:50Z

What if you use managedByAnotherFramework and set on such Pods kueue.x-k8s.io/suspended-by-parent: custom-framework.

This is a valid approach and would work if LWS sets the annotation. However, the root cause of this issue is precisely that LWS-created StatefulSets never receive kueue.x-k8s.io/pod-suspending-parent — they're not created through Kueue's webhook path, so no annotation is injected. That's also why #8862's fix didn't cover this case.

The DeletionTimestamp check is a Kueue-side fix that works without requiring changes in LWS or any other external framework. It's also semantically precise: a missing owner is only silently ignored when the object itself is already being deleted (i.e., genuinely in GC teardown), which avoids the cache-lag concern you raised earlier.

That said, if LWS can be updated to set the annotation, that would also be a correct fix — and these two approaches are complementary, not mutually exclusive.

mimowo · 2026-03-19T15:36:02Z

This is a valid approach and would work if LWS sets the annotation. However, the root cause of this issue is precisely that LWS-created StatefulSets never receive kueue.x-k8s.io/pod-suspending-parent — they're not created through Kueue's webhook path, so no annotation is injected. That's also why #8862 fix didn't cover this case.

Yeah, but as a quick fix did you consider having a webhook which injects the kueue.x-k8s.io/pod-suspending-parent annotation into the PodTemplate of the LWS or StatefulSet on their creation? Those LWS or StatefulSet which are not managed by Kueue?

The DeletionTimestamp check is a Kueue-side fix that works without requiring changes in LWS or any other external framework. It's also semantically precise: a missing owner is only silently ignored when the object itself is already being deleted (i.e., genuinely in GC teardown), which avoids the cache-lag concern you raised earlier.

Indeed, there is only one potential race, that maybe Kueue informer already knows about the DeletionTimestamp - so doesn't "suspend", while the other controller, say k8s Job hasn't seen yet the event about DeletionTimestamp, so it may start still creating Pods. I'm wodering how we could close this gap.

That said, if LWS can be updated to set the annotation, that would also be a correct fix — and these two approaches are complementary, not mutually exclusive.

Totally agree, we should improve Kueue's handling of such cases without the need to explicitly opt-out. I'm thinking about mitigation you could apply even already.

iaalm · 2026-03-19T15:36:06Z

Will try fix test tomorrow, if you agree with the solution

pkg/controller/jobframework/defaults.go

pkg/controller/jobs/pod/pod_webhook_test.go

mimowo · 2026-03-20T06:13:35Z

Will try fix test tomorrow, if you agree with the solution

I think this unfortunately remains a fundamantal flaw in this approach: "Indeed, there is only one potential race, that maybe Kueue informer already knows about the DeletionTimestamp - so doesn't "suspend", while the other controller, say k8s Job hasn't seen yet the event about DeletionTimestamp, so it may start still creating Pods."

Yes, this is unlikely but when it happens it could degrade user experience for some users with "regular setup" where there Job is immediately deleted for some reason.

I'm not sure how to solve it properly yet

When foreground and background deletion are mixed in the same ownership chain (e.g. LWS → Parent STS → Leader Pod → Child STS → Worker Pods), child objects can get permanently stuck in Terminating state. The GC sets foregroundDeletion finalizers on children during foreground deletion, but if a parent is concurrently removed via background deletion (e.g. helm uninstall), the GC's PATCH to remove the finalizer from a child is denied by Kueue's mutating webhook. The denial happens because the webhook calls WorkloadShouldBeSuspended → FindAncestorJobManagedByKueue, which walks the ownerReference chain. When it tries to fetch the deleted parent, c.Get returns NotFound, which previously propagated as ErrWorkloadOwnerNotFound and caused the webhook to reject the PATCH. Fix at two layers: 1. Webhook early-exit: in each webhook's Default() and ValidateUpdate(), skip the WorkloadShouldBeSuspended call entirely when the object being admitted already has a DeletionTimestamp. An object in Terminating state should not be re-evaluated for suspension or have new Kueue annotations applied. 2. Safety net in FindAncestorJobManagedByKueue: if the parent lookup returns NotFound AND the object being processed itself has a DeletionTimestamp, treat this as normal GC teardown rather than an error. This covers call sites that do not go through the webhook early-exit (e.g. the reconciler path). The DeletionTimestamp of the object being processed — not the parent's — is the discriminator. This avoids the informer-lag race where a creating controller hasn't yet seen a parent's DeletionTimestamp and still submits new children: those new children have no DeletionTimestamp themselves, so neither guard fires and ErrWorkloadOwnerNotFound is still returned, causing the admission to be retried until the cache catches up.

iaalm · 2026-03-20T13:44:30Z

Indeed, there is only one potential race, that maybe Kueue informer already knows about the DeletionTimestamp - so doesn't "suspend", while the other controller, say k8s Job hasn't seen yet the event about DeletionTimestamp, so it may start still creating Pods.

I think this concern may not apply to the current implementation, because both guards check the DeletionTimestamp of the admitted object itself, not the parent's.

A newly created Pod (from a controller that hasn't yet seen the parent's deletion event) has DeletionTimestamp = nil. Neither guard fires for it — it goes through the normal webhook path. If the parent is already gone (NotFound) and the new Pod has no DeletionTimestamp, FindAncestorJobManagedByKueue still returns ErrWorkloadOwnerNotFound, the webhook rejects the admission, and the controller retries until its cache catches up.

The guards only fire when the admitted object itself is already in Terminating state (DeletionTimestamp set by the API server upon DELETE). That can only happen via an explicit delete, not because a creating controller hasn't yet seen a parent's deletion. The two cases are mutually exclusive: an object is either being newly created (DeletionTimestamp = nil) or already in teardown (DeletionTimestamp ≠ nil).

Is there a specific scenario you have in mind where a newly created object would reach our guards? Happy to trace through it.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. area/integrations Workload integrations labels Mar 19, 2026

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 19, 2026

k8s-ci-robot requested review from mbobrovskyi and tenzen-y March 19, 2026 12:38

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 19, 2026

iaalm marked this pull request as ready for review March 19, 2026 13:16

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2026

k8s-ci-robot requested a review from olekzabl March 19, 2026 13:16

iaalm closed this Mar 19, 2026

iaalm reopened this Mar 19, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 19, 2026

iaalm force-pushed the main branch from b1d5b42 to d81fead Compare March 19, 2026 14:18

iaalm force-pushed the main branch from d81fead to 1736a2c Compare March 19, 2026 14:54

sohankunkerkar reviewed Mar 19, 2026

View reviewed changes

pkg/controller/jobframework/defaults.go Outdated Show resolved Hide resolved

pkg/controller/jobs/pod/pod_webhook_test.go Show resolved Hide resolved

iaalm force-pushed the main branch from 1736a2c to dfc9afd Compare March 20, 2026 12:59

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 20, 2026

iaalm force-pushed the main branch from dfc9afd to 2704a6a Compare March 20, 2026 12:59

iaalm force-pushed the main branch from 2704a6a to 22fe5d4 Compare March 20, 2026 13:05

Conversation

iaalm commented Mar 19, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

netlify bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Uh oh!

linux-foundation-easycla bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 19, 2026

Uh oh!

k8s-ci-robot commented Mar 19, 2026

Uh oh!

k8s-ci-robot commented Mar 19, 2026

Uh oh!

mimowo commented Mar 19, 2026

Uh oh!

mimowo commented Mar 19, 2026

Uh oh!

iaalm commented Mar 19, 2026

Uh oh!

mimowo commented Mar 19, 2026

Uh oh!

iaalm commented Mar 19, 2026

Uh oh!

mimowo commented Mar 19, 2026

Uh oh!

mimowo commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iaalm commented Mar 19, 2026

Uh oh!

mimowo commented Mar 19, 2026

Uh oh!

iaalm commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

mimowo commented Mar 20, 2026

Uh oh!

iaalm commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Mar 19, 2026 •

edited

Loading

linux-foundation-easycla bot commented Mar 19, 2026 •

edited

Loading

mimowo commented Mar 19, 2026 •

edited

Loading