Skip to content

Conversation

@Aman-Cool
Copy link

Fix: Graceful Eviction Race Condition

This change fixes a race condition between the graceful eviction controller and the status controller that could cause Work objects to be deleted too early during failover.

Previously, graceful eviction relied on AggregatedStatus without ensuring it had been updated for the current binding generation. Under certain controller execution orders, this could result in eviction tasks being removed before the new target cluster was ready, causing temporary workload unavailability.

The fix ensures eviction decisions are only made when the status is confirmed fresh for the current scheduling decision, restoring safe and deterministic failover behavior.

@gemini-code-assist
Copy link

Summary of Changes

Hello @Aman-Cool, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical race condition within the graceful eviction controller that could result in Work objects being deleted prematurely during failover scenarios. The fix ensures that the system's aggregated status is always current and synchronized with the latest scheduling decisions before any eviction tasks are processed. This enhancement significantly improves the robustness and reliability of workload failover by preventing actions based on outdated cluster status, thereby maintaining service availability.

Highlights

  • Race Condition Fix: Addressed a race condition in the graceful eviction controller that could lead to premature deletion of Work objects during failover.
  • Status Freshness Check: Introduced a statusUpToDate flag and a new isStatusUpToDate function to ensure eviction decisions are only made when AggregatedStatus accurately reflects the current binding generation.
  • Enhanced Status Aggregation Trigger: Modified the status controller's predicate to trigger status updates not only on resource changes but also when the list of target clusters changes, preventing stale status after rescheduling.
  • New Utility Function: Added targetClustersEqual to reliably compare slices of target clusters.
  • Comprehensive Testing: Included new test cases to validate the race condition fix and the functionality of the new status freshness checks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical race condition in the graceful eviction process. The solution is well-designed, ensuring the status controller updates AggregatedStatus upon scheduling changes and that the eviction controller verifies status freshness before acting. This prevents premature Work deletion during failover. The code changes are clear, well-commented, and include thorough testing for the new logic. I've identified one potential issue in a new helper function and provided a suggestion for a more robust implementation.

@Aman-Cool Aman-Cool force-pushed the fix/graceful-eviction-race-condition branch from d43acb0 to dcf4e60 Compare January 25, 2026 09:13
@Aman-Cool
Copy link
Author

This PR fixes a race where graceful eviction can act on stale status during failover, causing workloads to be deleted before the new cluster is ready. The change makes eviction wait for up-to-date status, preventing accidental downtime.

@codecov-commenter
Copy link

codecov-commenter commented Jan 25, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 74.19355% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.54%. Comparing base (437e37e) to head (20c106c).

Files with missing lines Patch % Lines
pkg/controllers/status/common.go 55.55% 8 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7129   +/-   ##
=======================================
  Coverage   46.54%   46.54%           
=======================================
  Files         700      700           
  Lines       48134    48163   +29     
=======================================
+ Hits        22403    22419   +16     
- Misses      24045    24056   +11     
- Partials     1686     1688    +2     
Flag Coverage Δ
unittests 46.54% <74.19%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The graceful eviction controller could make eviction decisions based on
stale AggregatedStatus, causing premature deletion of Works during cluster
failover. This fix:

- Triggers status controller when Spec.Clusters changes
- Adds statusUpToDate check before allowing eviction task removal

Signed-off-by: Aman-Cool <[email protected]>
@Aman-Cool Aman-Cool force-pushed the fix/graceful-eviction-race-condition branch from 3496e8b to 20c106c Compare January 25, 2026 11:46
@XiShanYongYe-Chang
Copy link
Member

Hi @Aman-Cool Thank you for your feedback.
I would like to know how you encountered this issue? Or, could you give an example to illustrate this problem of competition? I'm a bit unclear on what premature deletion of work means.

@Aman-Cool
Copy link
Author

Hi @XiShanYongYe-Chang , thanks for taking a look.
I encountered this during a cluster failover scenario. When a Work is re-scheduled to a new target cluster, the graceful eviction controller may run before the status controller updates AggregatedStatus for the new binding generation. In that window, the eviction logic sees the old status and concludes the Work is safe to delete, even though the new target cluster hasn’t finished applying it yet.
By “premature deletion,” I mean the Work object is removed from the old cluster before the workload becomes ready on the new cluster, causing a short period of unavailability. The race depends on controller execution order, which makes it intermittent and hard to reproduce consistently.

@Aman-Cool
Copy link
Author

Test Plan

  • Deploy a Work object bound to a member cluster.
  • Trigger rescheduling by simulating member cluster failure or rebinding the Work to a different cluster.
  • Ensure the graceful eviction controller runs before the status controller updates AggregatedStatus for the new binding generation.

Expected behavior

  • Before fix: the Work may be deleted from the old cluster before the new target cluster reports ready, causing temporary workload unavailability.
  • After fix: eviction is deferred until AggregatedStatus is confirmed fresh for the current binding generation; the Work remains available until the new cluster is ready.

Verification

  • Repeatedly trigger failover and observe Work lifecycle, deletion timing, and status transitions via controller logs and object state.
  • Confirm no premature deletion occurs during controller execution order variation.

@XiShanYongYe-Chang
Copy link
Member

Thanks @Aman-Cool Let me take a look at this.

By the way, may I ask what your company is? Are you currently using Karmada and the ClusterFailover feature?

@Aman-Cool
Copy link
Author

Thanks @XiShanYongYe-Chang for taking a look!
I’m contributing as an individual. I’m not running Karmada in production, but I’ve been evaluating the ClusterFailover flow and noticed this behavior while reasoning through controller ordering and failover scenarios.

@XiShanYongYe-Chang
Copy link
Member

Thanks, I get it~

@XiShanYongYe-Chang
Copy link
Member

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants