Fix graceful eviction race condition causing premature Work deletion #7129

Aman-Cool · 2026-01-25T09:07:56Z

Fix: Graceful Eviction Race Condition

This change fixes a race condition between the graceful eviction controller and the status controller that could cause Work objects to be deleted too early during failover.

Previously, graceful eviction relied on AggregatedStatus without ensuring it had been updated for the current binding generation. Under certain controller execution orders, this could result in eviction tasks being removed before the new target cluster was ready, causing temporary workload unavailability.

The fix ensures eviction decisions are only made when the status is confirmed fresh for the current scheduling decision, restoring safe and deterministic failover behavior.

gemini-code-assist · 2026-01-25T09:08:14Z

Summary of Changes

Hello @Aman-Cool, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical race condition within the graceful eviction controller that could result in Work objects being deleted prematurely during failover scenarios. The fix ensures that the system's aggregated status is always current and synchronized with the latest scheduling decisions before any eviction tasks are processed. This enhancement significantly improves the robustness and reliability of workload failover by preventing actions based on outdated cluster status, thereby maintaining service availability.

Highlights

Race Condition Fix: Addressed a race condition in the graceful eviction controller that could lead to premature deletion of Work objects during failover.
Status Freshness Check: Introduced a statusUpToDate flag and a new isStatusUpToDate function to ensure eviction decisions are only made when AggregatedStatus accurately reflects the current binding generation.
Enhanced Status Aggregation Trigger: Modified the status controller's predicate to trigger status updates not only on resource changes but also when the list of target clusters changes, preventing stale status after rescheduling.
New Utility Function: Added targetClustersEqual to reliably compare slices of target clusters.
Comprehensive Testing: Included new test cases to validate the race condition fix and the functionality of the new status freshness checks.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses a critical race condition in the graceful eviction process. The solution is well-designed, ensuring the status controller updates AggregatedStatus upon scheduling changes and that the eviction controller verifies status freshness before acting. This prevents premature Work deletion during failover. The code changes are clear, well-commented, and include thorough testing for the new logic. I've identified one potential issue in a new helper function and provided a suggestion for a more robust implementation.

pkg/controllers/status/common.go

Aman-Cool · 2026-01-25T09:20:21Z

This PR fixes a race where graceful eviction can act on stale status during failover, causing workloads to be deleted before the new cluster is ready. The change makes eviction wait for up-to-date status, preventing accidental downtime.

codecov-commenter · 2026-01-25T09:31:09Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 74.19355% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.54%. Comparing base (437e37e) to head (20c106c).

Files with missing lines	Patch %	Lines
pkg/controllers/status/common.go	55.55%	8 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #7129   +/-   ##
=======================================
  Coverage   46.54%   46.54%           
=======================================
  Files         700      700           
  Lines       48134    48163   +29     
=======================================
+ Hits        22403    22419   +16     
- Misses      24045    24056   +11     
- Partials     1686     1688    +2

Flag	Coverage Δ
unittests	`46.54% <74.19%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The graceful eviction controller could make eviction decisions based on stale AggregatedStatus, causing premature deletion of Works during cluster failover. This fix: - Triggers status controller when Spec.Clusters changes - Adds statusUpToDate check before allowing eviction task removal Signed-off-by: Aman-Cool <[email protected]>

XiShanYongYe-Chang · 2026-01-26T02:10:27Z

Hi @Aman-Cool Thank you for your feedback.
I would like to know how you encountered this issue? Or, could you give an example to illustrate this problem of competition? I'm a bit unclear on what premature deletion of work means.

Aman-Cool · 2026-01-26T06:57:42Z

Hi @XiShanYongYe-Chang , thanks for taking a look.
I encountered this during a cluster failover scenario. When a Work is re-scheduled to a new target cluster, the graceful eviction controller may run before the status controller updates AggregatedStatus for the new binding generation. In that window, the eviction logic sees the old status and concludes the Work is safe to delete, even though the new target cluster hasn’t finished applying it yet.
By “premature deletion,” I mean the Work object is removed from the old cluster before the workload becomes ready on the new cluster, causing a short period of unavailability. The race depends on controller execution order, which makes it intermittent and hard to reproduce consistently.

Aman-Cool · 2026-01-26T07:00:20Z

Test Plan

Deploy a Work object bound to a member cluster.
Trigger rescheduling by simulating member cluster failure or rebinding the Work to a different cluster.
Ensure the graceful eviction controller runs before the status controller updates AggregatedStatus for the new binding generation.

Expected behavior

Before fix: the Work may be deleted from the old cluster before the new target cluster reports ready, causing temporary workload unavailability.
After fix: eviction is deferred until AggregatedStatus is confirmed fresh for the current binding generation; the Work remains available until the new cluster is ready.

Verification

Repeatedly trigger failover and observe Work lifecycle, deletion timing, and status transitions via controller logs and object state.
Confirm no premature deletion occurs during controller execution order variation.

XiShanYongYe-Chang · 2026-01-26T07:04:26Z

Thanks @Aman-Cool Let me take a look at this.

By the way, may I ask what your company is? Are you currently using Karmada and the ClusterFailover feature?

Aman-Cool · 2026-01-26T07:10:44Z

Thanks @XiShanYongYe-Chang for taking a look!
I’m contributing as an individual. I’m not running Karmada in production, but I’ve been evaluating the ClusterFailover flow and noticed this behavior while reasoning through controller ordering and failover scenarios.

XiShanYongYe-Chang · 2026-01-26T07:43:05Z

Thanks, I get it~

XiShanYongYe-Chang · 2026-01-26T11:42:57Z

/retest

gemini-code-assist bot reviewed Jan 25, 2026

View reviewed changes

pkg/controllers/status/common.go Show resolved Hide resolved

Aman-Cool force-pushed the fix/graceful-eviction-race-condition branch from d43acb0 to dcf4e60 Compare January 25, 2026 09:13

Aman-Cool force-pushed the fix/graceful-eviction-race-condition branch from 3496e8b to 20c106c Compare January 25, 2026 11:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix graceful eviction race condition causing premature Work deletion #7129

Fix graceful eviction race condition causing premature Work deletion #7129

Uh oh!

Aman-Cool commented Jan 25, 2026

Uh oh!

gemini-code-assist bot commented Jan 25, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Aman-Cool commented Jan 25, 2026

Uh oh!

codecov-commenter commented Jan 25, 2026 •

edited

Loading

Uh oh!

XiShanYongYe-Chang commented Jan 26, 2026

Uh oh!

Aman-Cool commented Jan 26, 2026

Uh oh!

Aman-Cool commented Jan 26, 2026

Uh oh!

XiShanYongYe-Chang commented Jan 26, 2026

Uh oh!

Aman-Cool commented Jan 26, 2026

Uh oh!

XiShanYongYe-Chang commented Jan 26, 2026

Uh oh!

XiShanYongYe-Chang commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix graceful eviction race condition causing premature Work deletion #7129

Are you sure you want to change the base?

Fix graceful eviction race condition causing premature Work deletion #7129

Uh oh!

Conversation

Aman-Cool commented Jan 25, 2026

Fix: Graceful Eviction Race Condition

Uh oh!

gemini-code-assist bot commented Jan 25, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Aman-Cool commented Jan 25, 2026

Uh oh!

codecov-commenter commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

XiShanYongYe-Chang commented Jan 26, 2026

Uh oh!

Aman-Cool commented Jan 26, 2026

Uh oh!

Aman-Cool commented Jan 26, 2026

Uh oh!

XiShanYongYe-Chang commented Jan 26, 2026

Uh oh!

Aman-Cool commented Jan 26, 2026

Uh oh!

XiShanYongYe-Chang commented Jan 26, 2026

Uh oh!

XiShanYongYe-Chang commented Jan 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jan 25, 2026 •

edited

Loading