Skip to content

Conversation

@goyalpalak18
Copy link

Problem summary

I found a crash scenario in the Application Failover controllers (RBApplicationFailoverController and CRBApplicationFailoverController) when decisionConditions.tolerationSeconds is not set.

  • TolerationSeconds is an optional *int32
  • The controllers dereference it without a nil check during failure detection
  • When omitted, this leads to a nil pointer panic and a controller crash loop
  • The CRD default of 300s is not applied for resources created programmatically or migrated from older versions

This panic happens exactly during cluster failure scenarios, which makes it particularly risky.


What I changed

1. Controller-side safety (defense in depth)

I added a nil check in both application failover controllers so tolerationSeconds is always initialized before use:

  • pkg/controllers/applicationfailover/rb_application_failover_controller.go
  • pkg/controllers/applicationfailover/crb_application_failover_controller.go

If tolerationSeconds is nil, it now defaults to 300 seconds, which prevents panics even when webhook defaulting is skipped.

2. Webhook defaulting

I added a shared helper function to apply the same default during admission:

  • New helper: SetDefaultTolerationSeconds() in pkg/util/helper/policy.go
  • Called from:
    • propagationpolicy/mutating.go
    • clusterpropagationpolicy/mutating.go

This ensures newly created policies get the expected default value.

3. Unit tests

I added unit tests for the new helper to verify:

  • Existing tolerationSeconds values are not overwritten
  • The default of 300 is applied when the field is missing
  • Other fields remain unchanged

Why this fix is complete

  • The controllers can no longer panic due to a nil tolerationSeconds
  • Defaulting is consistent with the CRD-documented value of 300s
  • The change is small, low risk, and suitable for backporting
  • The defaulting logic is covered by unit tests

This makes application failover more reliable during real cluster failure scenarios.

Add nil check for TolerationSeconds in syncBinding() to prevent
controller panic when the field is not set. Also add webhook
defaulting for TolerationSeconds (300s) to match CRD schema.

Signed-off-by: goyalpalak18 <goyalpalak1806@gmail.com>
@karmada-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xishanyongye-chang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot
Copy link
Contributor

Welcome @goyalpalak18! It looks like this is your first PR to karmada-io/karmada 🎉

@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 23, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @goyalpalak18, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical nil pointer panic in the application failover controllers that occurred when "decisionConditions.tolerationSeconds" was unset, leading to controller crash loops during cluster failure scenarios. The changes introduce robust nil checks within the controllers and establish consistent defaulting via webhooks, significantly enhancing the reliability of application failover mechanisms.

Highlights

  • Controller Nil Pointer Fix: Implemented nil checks for "tolerationSeconds" in "RBApplicationFailoverController" and "CRBApplicationFailoverController" to prevent panics, defaulting to 300 seconds when the value is not explicitly set.
  • Webhook Defaulting: Introduced a new helper function "SetDefaultTolerationSeconds" in "pkg/util/helper/policy.go" and integrated it into the mutating webhooks for "PropagationPolicy" and "ClusterPropagationPolicy" to ensure consistent defaulting during admission.
  • Unit Test Coverage: Added comprehensive unit tests for the "SetDefaultTolerationSeconds" helper to validate its behavior under various conditions, including when "tolerationSeconds" is already set, not set, or nil with other fields present.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a critical nil pointer panic in the application failover controllers by introducing robust nil checks for tolerationSeconds and ensuring a default value of 300 seconds is applied when it's not explicitly set. The changes also include a new helper function SetDefaultTolerationSeconds in pkg/util/helper/policy.go to consistently apply this default during admission, which is a good defense-in-depth strategy. Comprehensive unit tests have been added for the new helper, and existing webhook tests were updated to reflect the new defaulting behavior. Overall, this is a well-implemented and crucial fix that significantly improves the reliability of application failover during cluster failure scenarios.

Comment on lines +125 to +128
// If tolerationSeconds is not set, use the default value of 300 seconds
// to avoid nil pointer dereference in detectFailure.
if tolerationSeconds == nil {
tolerationSeconds = ptr.To[int32](300)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This nil check and defaulting logic is a critical fix to prevent the nil pointer dereference identified in the problem summary. It ensures the controller remains stable even when tolerationSeconds is not explicitly set, which is crucial for the reliability of the application failover mechanism.

Comment on lines +125 to +128
// If tolerationSeconds is not set, use the default value of 300 seconds
// to avoid nil pointer dereference in detectFailure.
if tolerationSeconds == nil {
tolerationSeconds = ptr.To[int32](300)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the CRBApplicationFailoverController, this nil check and defaulting logic for tolerationSeconds is a critical improvement. It directly addresses the potential for a nil pointer panic, enhancing the robustness of the application failover controller.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 55.55556% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.54%. Comparing base (7fcc638) to head (f94a8a7).

Files with missing lines Patch % Lines
...ionfailover/crb_application_failover_controller.go 0.00% 1 Missing and 1 partial ⚠️
...tionfailover/rb_application_failover_controller.go 0.00% 1 Missing and 1 partial ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7127      +/-   ##
==========================================
- Coverage   46.55%   46.54%   -0.01%     
==========================================
  Files         700      700              
  Lines       48133    48142       +9     
==========================================
+ Hits        22406    22407       +1     
- Misses      24041    24047       +6     
- Partials     1686     1688       +2     
Flag Coverage Δ
unittests 46.54% <55.55%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RainbowMango
Copy link
Member

@mszacillo @XiShanYongYe-Chang Maybe you can take a look.

@RainbowMango RainbowMango added this to the v1.17 milestone Jan 24, 2026
Copy link
Member

@mszacillo mszacillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this. Could you advise on how to reproduce this bug? You mention this only happens when generating the propagationpolicy programmatically or when updating from a previous version - which version upgrades cause an issue?

// If tolerationSeconds is not set, use the default value of 300 seconds
// to avoid nil pointer dereference in detectFailure.
if tolerationSeconds == nil {
tolerationSeconds = ptr.To[int32](300)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have an issue including an additional fallback for the tolerationSeconds if this is somehow unset, but your additional to the mutation webhook should be enough to safeguard against this field being empty. If we keep this, can we please add a warning log to notify the user that this field is not set, and that we expect it to be?

@mszacillo
Copy link
Member

@RainbowMango Another comment that you may have more context for. Is there a historical reason as to why we let tolerationSeconds be optional? Was this to make the use of the application failover API a bit easier for the user?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants