Skip to content

Conversation

@vie-serendipity
Copy link
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

The cluster informer manager should sync the cache first to avoid directly fetching from the apiserver. This means detectors and other controllers have dependencies, so detectors should be placed in the controller-runtime cache case.

Avoid starting detectors and other controllers in a parallel way that causes the cache to be unavailable and other controllers directly query the apiserver.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 13, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @vie-serendipity, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential race condition where controllers might attempt to fetch resources from the API server before the local informer cache is fully populated. By ensuring that the ResourceDetector's cache is synchronized upfront, it optimizes resource access patterns, reduces unnecessary load on the API server, and promotes more efficient operation of dependent controllers.

Highlights

  • Cache Synchronization Priority: Ensures that the informer cache is fully synchronized before other controllers, such as detectors, begin their operations. This prevents premature API server queries due to an empty or stale cache.
  • ResourceDetector Cache Integration: The ResourceDetector now implements a hasCache interface and provides a GetCache() method, allowing it to be integrated with controller-runtime's cache mechanisms.
  • Dedicated Cache Initialization: Introduced a new initCache() method within ResourceDetector to encapsulate the logic for setting up informers, starting them, and waiting for their caches to sync, improving modularity and control over the cache lifecycle.
  • Cache Synchronization Signaling: A syncedCh channel has been added to ResourceDetector and a cacheAdapter struct created to signal when the cache has completed its initial synchronization, providing a clear readiness indicator.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 13, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the ResourceDetector to ensure its informer cache is synchronized before other controllers start. This is achieved by introducing a blocking cache synchronization in initCache and providing a cacheAdapter that allows other components to wait for this synchronization.

My review focuses on a critical issue in the cacheAdapter implementation that could lead to a panic, a suggestion to refactor duplicated code for better maintainability, and a minor style guide violation regarding a missing comment on an exported method.

Overall, the approach is sound, but the implementation of cacheAdapter needs to be made safer to prevent potential runtime panics.

Comment on lines 1531 to 1534
type cacheAdapter struct {
ctrlcache.Cache
syncedCh chan struct{}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The cacheAdapter struct embeds ctrlcache.Cache. When an instance is created with cacheAdapter{syncedCh: d.syncedCh}, the embedded field is a nil interface. If this cacheAdapter is used in a cache.MultiCache (a common pattern for combining caches in controller-runtime), any call to methods like Get or List will be dispatched to the embedded nil interface, causing a panic. This is a critical issue that could crash the controller manager.

To fix this, cacheAdapter should implement the ctrlcache.Cache interface without embedding it. The methods that are not intended to be used can return an error.

Here is a safe implementation:

import (
	"context"
	"errors"

	"k8s.io/apimachinery/pkg/runtime/schema"
	ctrlcache "sigs.k8s.io/controller-runtime/pkg/cache"
	"sigs.k8s.io/controller-runtime/pkg/client"
)

type cacheAdapter struct {
	syncedCh chan struct{}
}

// ensure cacheAdapter implements ctrlcache.Cache
var _ ctrlcache.Cache = cacheAdapter{}

func (c cacheAdapter) Get(context.Context, client.ObjectKey, client.Object, ...client.GetOption) error {
	return errors.New("get is not supported")
}

func (c cacheAdapter) List(context.Context, client.ObjectList, ...client.ListOption) error {
	return errors.New("list is not supported")
}

func (c cacheAdapter) GetInformer(context.Context, client.Object) (ctrlcache.Informer, error) {
	return nil, errors.New("getInformer is not supported")
}

func (c cacheAdapter) GetInformerForKind(context.Context, schema.GroupVersionKind) (ctrlcache.Informer, error) {
	return nil, errors.New("getInformerForKind is not supported")
}

func (c cacheAdapter) Start(context.Context) error {
	return errors.New("start is not supported")
}

func (c cacheAdapter) IndexField(context.Context, client.Object, string, client.IndexerFunc) error {
	return errors.New("indexField is not supported")
}

func (c cacheAdapter) WaitForCacheSync(ctx context.Context) bool {
	select {
	case <-ctx.Done():
		return false
	case <-c.syncedCh:
		return true
	}
}

return nil
}

func (d *ResourceDetector) GetCache() ctrlcache.Cache {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This exported function GetCache is missing a comment. According to the style guide, all exported functions must be documented. Please add a comment explaining what this function does.

Suggested change
func (d *ResourceDetector) GetCache() ctrlcache.Cache {
// GetCache returns a cache that can be used to wait for the detector's informer cache to sync.
func (d *ResourceDetector) GetCache() ctrlcache.Cache {
References
  1. All exported functions, methods, structs, and interfaces must be documented with clear and concise comments describing their purpose and behavior. (link)

Comment on lines 201 to 208
newResources := lifted.GetDeletableResources(d.DiscoveryClientSet)
for r := range newResources {
if d.InformerManager.IsHandlerExist(r, d.EventHandler) || d.gvrDisabled(r) {
continue
}
klog.Infof("Setup informer for %s", r.String())
d.InformerManager.ForResource(r, d.EventHandler)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for discovering resources and setting up informers is duplicated in initCache and discoverResources. This duplication can be extracted into a private helper method to improve code maintainability and reduce redundancy.

For example, you could create a setupInformersForNewResources method to encapsulate this logic.

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.53%. Comparing base (6a17ea8) to head (67853b6).

Files with missing lines Patch % Lines
pkg/detector/detector.go 0.00% 20 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7104      +/-   ##
==========================================
- Coverage   46.55%   46.53%   -0.02%     
==========================================
  Files         700      700              
  Lines       48103    48123      +20     
==========================================
+ Hits        22395    22396       +1     
- Misses      24028    24047      +19     
  Partials     1680     1680              
Flag Coverage Δ
unittests 46.53% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RainbowMango
Copy link
Member

@vie-serendipity Can you elaborate on it? Which informer should be synced? And what's the side-effect?

@karmada-bot karmada-bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 15, 2026
@vie-serendipity
Copy link
Contributor Author

vie-serendipity commented Jan 15, 2026

When there are a lot of workloads in the cluster, if the karmada controller manager restarts, the detector starts in parallel with the other controllers. This means that when a controller begins to reconcile on getting workloads, the detector's InformerManager may not have synced cache yet and will fall back to the API server. Frequent API server access increases the load of the API server.

Therefore, it's reasonable to sync cache before real controllers start to work. So I think we can put detector to case hasCache. That's what I do before.

switch runnable := fn.(type) {
	case *Server:

	case hasCache:

	case webhook.Server:

	case warmupRunnable, LeaderElectionRunnable:
              
	default:
		return r.LeaderElection.Add(fn, nil)
}

After considering and testing, this will bring a really long start-up time, leading that we have to configure long initialDelaySeconds for liveness probe. This is not good.

I came up with another approach: block while getting or listing the objects from the SingleClusterInformerManager until its cache is synced, just like the client in controller-runtime does.

@karmada-bot karmada-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 15, 2026
@karmada-bot karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 16, 2026
@vie-serendipity vie-serendipity force-pushed the feat/detector-cache branch 2 times, most recently from 5bf7f21 to 3336800 Compare January 16, 2026 06:13
@karmada-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chaunceyjiang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@RainbowMango
Copy link
Member

When there are a lot of workloads in the cluster, if the karmada controller manager restarts, the detector starts in parallel with the other controllers. This means that when a controller begins to reconcile on getting workloads, the detector's InformerManager may not have synced cache yet and will fall back to the API server. Frequent API server access increases the load of the API server.

Do you mean the controlPlaneInformerManager?

controlPlaneInformerManager := genericmanager.NewSingleClusterInformerManager(ctx, dynamicClientSet, opts.ResyncPeriod.Duration)

I understand that the controllers widely use this informer manager, but the new informers are registered inthe detector:
func (d *ResourceDetector) discoverResources(ctx context.Context, period time.Duration) {
wait.Until(func() {
newResources := lifted.GetDeletableResources(d.DiscoveryClientSet)
for r := range newResources {
if d.InformerManager.IsHandlerExist(r, d.EventHandler) || d.gvrDisabled(r) {
continue
}
klog.Infof("Setup informer for %s", r.String())
d.InformerManager.ForResource(r, d.EventHandler)
}
d.InformerManager.Start()
}, period, ctx.Done())
}

Frequent API server access increases the load of the API server.

How bad is it?

@vie-serendipity
Copy link
Contributor Author

Do you mean the controlPlaneInformerManager?

Yes.

How bad is it?

Not too bad. At about the 100k workload level, after a restart, some requests go directly to the API server, like 250 req/s. It seems like the API server should be able to handle this level of load, but I think it could be optimized.

image

@RainbowMango
Copy link
Member

At about the 100k workload level, after a restart, some requests go directly to the API server, like 250 req/s.

It sounds like you have a higher qps configuration, right? As the default QPS is 40.

      --kube-api-burst int                                                                                                                                                                                       
                Burst to use while talking with karmada-apiserver. (default 60)
      --kube-api-qps float32                                                                                                                                                                                     
                QPS to use while talking with karmada-apiserver. (default 40)

@vie-serendipity
Copy link
Contributor Author

It sounds like you have a higher qps configuration, right? As the default QPS is 40.

Yes, I set qps and burst to 600 and 100 respectively.

@RainbowMango
Copy link
Member

The QPS setting reflects the API server's processing capacity, and what we can do is ensure that the total requests from all controllers stay below this threshold. I don't see a strong need for further optimization here.

@vie-serendipity
Copy link
Contributor Author

what we can do is ensure that the total requests from all controllers stay below this threshold.

I'm not sure we're aligned.
The controller's requests should be watch requests and write requests (create/delete/update); all read requests (get/list) should come from the cache. But here a large number of read requests are going directly to the apiserver, which I believe is not expected. I believe the higher QPS setting is meant to allow the controller to make more write requests, not to perform direct reads.

@RainbowMango
Copy link
Member

Really? My understanding is that both read/write requests are counted for QPS. @zach593 Can you confirm that?

@vie-serendipity
Copy link
Contributor Author

Really? My understanding is that both read/write requests are counted for QPS.

I think you misunderstand me. I mean the controller's read requests should go to the cache (read requests to cache are not accounted for QPS) instead of directly to the API server (read requests to API server are accouted for QPS), while the informer’s list/watch requests are accouted for QPS.
The current problem is that the controller can't get data from the cache and falls back to the API server, directly reading and consuming QPS, which I think is not expected. The ideal behaviour is wait for cache synced and then read from cache.

@RainbowMango
Copy link
Member

Yeah, I get it. Thank you.

An idea in my mind that lets controllers who are using the informer wait until the cache gets synced, but I doubt it is worth doing that, as the QPS is still under control, not going beyond the API server's capacity. Doing so, would slow down the controller start time.

d.InformerManager.ForResource(r, d.EventHandler)
}
d.InformerManager.Start()
d.InformerManager.WaitForCacheSync()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this wait needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation has some problems. I want to make sure whether this change is necessary.

@zach593
Copy link
Contributor

zach593 commented Jan 21, 2026

Really? My understanding is that both read/write requests are counted for QPS. @zach593 Can you confirm that?

It seems I don’t need to answer this anymore.


After considering and testing, this will bring a really long start-up time, leading that we have to configure long initialDelaySeconds for liveness probe. This is not good.

Would you like to explain a bit why the hasCache version caused a longer startup time than this version? @vie-serendipity

Or, can I understand it this way: the cache-sync time is roughly the same, but the hasCache approach affects the liveness probe result?

@zach593
Copy link
Contributor

zach593 commented Jan 21, 2026

I basically agree with your point that reconciliation should happen after the relevant caches have synced, but I don’t understand in what situation the dynamic client would be used. Do you mean this part?

func (d *ResourceDetector) fetchResourceTemplate(rs policyv1alpha1.ResourceSelector) (*unstructured.Unstructured, error) {
resourceTemplate, err := helper.FetchResourceTemplate(context.TODO(), d.DynamicClient, d.InformerManager, d.RESTMapper, helper.ConstructObjectReference(rs))
if err != nil {

@zach593
Copy link
Contributor

zach593 commented Jan 21, 2026

Hi @RainbowMango , do you know the background of using the dynamic client as a backup in the past? I can see this pattern scattered across the repo, and some of the code paths are actually never reached.
Also, in our (ctrip) branch, these dynamic-client behaviors have already been removed as part of performance optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants