feat: informer should sync cache before other controllers start #7104

vie-serendipity · 2026-01-13T03:44:10Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

The cluster informer manager should sync the cache first to avoid directly fetching from the apiserver. This means detectors and other controllers have dependencies, so detectors should be placed in the controller-runtime cache case.

Avoid starting detectors and other controllers in a parallel way that causes the cache to be unavailable and other controllers directly query the apiserver.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

gemini-code-assist · 2026-01-13T03:44:25Z

Summary of Changes

Hello @vie-serendipity, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential race condition where controllers might attempt to fetch resources from the API server before the local informer cache is fully populated. By ensuring that the ResourceDetector's cache is synchronized upfront, it optimizes resource access patterns, reduces unnecessary load on the API server, and promotes more efficient operation of dependent controllers.

Highlights

Cache Synchronization Priority: Ensures that the informer cache is fully synchronized before other controllers, such as detectors, begin their operations. This prevents premature API server queries due to an empty or stale cache.
ResourceDetector Cache Integration: The ResourceDetector now implements a hasCache interface and provides a GetCache() method, allowing it to be integrated with controller-runtime's cache mechanisms.
Dedicated Cache Initialization: Introduced a new initCache() method within ResourceDetector to encapsulate the logic for setting up informers, starting them, and waiting for their caches to sync, improving modularity and control over the cache lifecycle.
Cache Synchronization Signaling: A syncedCh channel has been added to ResourceDetector and a cacheAdapter struct created to signal when the cache has completed its initial synchronization, providing a clear readiness indicator.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request modifies the ResourceDetector to ensure its informer cache is synchronized before other controllers start. This is achieved by introducing a blocking cache synchronization in initCache and providing a cacheAdapter that allows other components to wait for this synchronization.

My review focuses on a critical issue in the cacheAdapter implementation that could lead to a panic, a suggestion to refactor duplicated code for better maintainability, and a minor style guide violation regarding a missing comment on an exported method.

Overall, the approach is sound, but the implementation of cacheAdapter needs to be made safer to prevent potential runtime panics.

gemini-code-assist · 2026-01-13T03:47:17Z

pkg/detector/detector.go

+type cacheAdapter struct {
+	ctrlcache.Cache
+	syncedCh chan struct{}
+}


The cacheAdapter struct embeds ctrlcache.Cache. When an instance is created with cacheAdapter{syncedCh: d.syncedCh}, the embedded field is a nil interface. If this cacheAdapter is used in a cache.MultiCache (a common pattern for combining caches in controller-runtime), any call to methods like Get or List will be dispatched to the embedded nil interface, causing a panic. This is a critical issue that could crash the controller manager.

To fix this, cacheAdapter should implement the ctrlcache.Cache interface without embedding it. The methods that are not intended to be used can return an error.

Here is a safe implementation:

import ( "context" "errors" "k8s.io/apimachinery/pkg/runtime/schema" ctrlcache "sigs.k8s.io/controller-runtime/pkg/cache" "sigs.k8s.io/controller-runtime/pkg/client" ) type cacheAdapter struct { syncedCh chan struct{} } // ensure cacheAdapter implements ctrlcache.Cache var _ ctrlcache.Cache = cacheAdapter{} func (c cacheAdapter) Get(context.Context, client.ObjectKey, client.Object, ...client.GetOption) error { return errors.New("get is not supported") } func (c cacheAdapter) List(context.Context, client.ObjectList, ...client.ListOption) error { return errors.New("list is not supported") } func (c cacheAdapter) GetInformer(context.Context, client.Object) (ctrlcache.Informer, error) { return nil, errors.New("getInformer is not supported") } func (c cacheAdapter) GetInformerForKind(context.Context, schema.GroupVersionKind) (ctrlcache.Informer, error) { return nil, errors.New("getInformerForKind is not supported") } func (c cacheAdapter) Start(context.Context) error { return errors.New("start is not supported") } func (c cacheAdapter) IndexField(context.Context, client.Object, string, client.IndexerFunc) error { return errors.New("indexField is not supported") } func (c cacheAdapter) WaitForCacheSync(ctx context.Context) bool { select { case <-ctx.Done(): return false case <-c.syncedCh: return true } }

gemini-code-assist · 2026-01-13T03:47:17Z

pkg/detector/detector.go

 	return nil
 }

+func (d *ResourceDetector) GetCache() ctrlcache.Cache {


This exported function GetCache is missing a comment. According to the style guide, all exported functions must be documented. Please add a comment explaining what this function does.

Suggested change

func (d *ResourceDetector) GetCache() ctrlcache.Cache {

// GetCache returns a cache that can be used to wait for the detector's informer cache to sync.

func (d *ResourceDetector) GetCache() ctrlcache.Cache {

References

All exported functions, methods, structs, and interfaces must be documented with clear and concise comments describing their purpose and behavior. ^(link)

gemini-code-assist · 2026-01-13T03:47:17Z

pkg/detector/detector.go

+	newResources := lifted.GetDeletableResources(d.DiscoveryClientSet)
+	for r := range newResources {
+		if d.InformerManager.IsHandlerExist(r, d.EventHandler) || d.gvrDisabled(r) {
+			continue
+		}
+		klog.Infof("Setup informer for %s", r.String())
+		d.InformerManager.ForResource(r, d.EventHandler)
+	}


The logic for discovering resources and setting up informers is duplicated in initCache and discoverResources. This duplication can be extracted into a private helper method to improve code maintainability and reduce redundancy.

For example, you could create a setupInformersForNewResources method to encapsulate this logic.

codecov-commenter · 2026-01-13T04:00:51Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.53%. Comparing base (6a17ea8) to head (67853b6).

Files with missing lines	Patch %	Lines
pkg/detector/detector.go	0.00%	20 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7104      +/-   ##
==========================================
- Coverage   46.55%   46.53%   -0.02%     
==========================================
  Files         700      700              
  Lines       48103    48123      +20     
==========================================
+ Hits        22395    22396       +1     
- Misses      24028    24047      +19     
  Partials     1680     1680

Flag	Coverage Δ
unittests	`46.53% <0.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

RainbowMango · 2026-01-14T03:37:56Z

@vie-serendipity Can you elaborate on it? Which informer should be synced? And what's the side-effect?

vie-serendipity · 2026-01-15T09:23:29Z

When there are a lot of workloads in the cluster, if the karmada controller manager restarts, the detector starts in parallel with the other controllers. This means that when a controller begins to reconcile on getting workloads, the detector's InformerManager may not have synced cache yet and will fall back to the API server. Frequent API server access increases the load of the API server.

Therefore, it's reasonable to sync cache before real controllers start to work. So I think we can put detector to case hasCache. That's what I do before.

switch runnable := fn.(type) {
	case *Server:

	case hasCache:

	case webhook.Server:

	case warmupRunnable, LeaderElectionRunnable:
              
	default:
		return r.LeaderElection.Add(fn, nil)
}

After considering and testing, this will bring a really long start-up time, leading that we have to configure long initialDelaySeconds for liveness probe. This is not good.

I came up with another approach: block while getting or listing the objects from the SingleClusterInformerManager until its cache is synced, just like the client in controller-runtime does.

karmada-bot · 2026-01-16T06:13:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chaunceyjiang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: vie-serendipity <[email protected]>

RainbowMango · 2026-01-20T02:51:23Z

When there are a lot of workloads in the cluster, if the karmada controller manager restarts, the detector starts in parallel with the other controllers. This means that when a controller begins to reconcile on getting workloads, the detector's InformerManager may not have synced cache yet and will fall back to the API server. Frequent API server access increases the load of the API server.

Do you mean the controlPlaneInformerManager?

karmada/cmd/controller-manager/app/controllermanager.go

Line 859 in e6cee63

    
           controlPlaneInformerManager := genericmanager.NewSingleClusterInformerManager(ctx, dynamicClientSet, opts.ResyncPeriod.Duration)

I understand that the controllers widely use this informer manager, but the new informers are registered inthe detector:

karmada/pkg/detector/detector.go

Lines 186 to 198 in fe621a2

    
           func (d *ResourceDetector) discoverResources(ctx context.Context, period time.Duration) { 
        
           	wait.Until(func() { 
        
           		newResources := lifted.GetDeletableResources(d.DiscoveryClientSet) 
        
           		for r := range newResources { 
        
           			if d.InformerManager.IsHandlerExist(r, d.EventHandler) || d.gvrDisabled(r) { 
        
           				continue 
        
           			} 
        
           			klog.Infof("Setup informer for %s", r.String()) 
        
           			d.InformerManager.ForResource(r, d.EventHandler) 
        
           		} 
        
           		d.InformerManager.Start() 
        
           	}, period, ctx.Done()) 
        
           }

Frequent API server access increases the load of the API server.

How bad is it?

vie-serendipity · 2026-01-20T03:22:52Z

Do you mean the controlPlaneInformerManager?

Yes.

How bad is it?

Not too bad. At about the 100k workload level, after a restart, some requests go directly to the API server, like 250 req/s. It seems like the API server should be able to handle this level of load, but I think it could be optimized.

RainbowMango · 2026-01-20T06:36:31Z

At about the 100k workload level, after a restart, some requests go directly to the API server, like 250 req/s.

It sounds like you have a higher qps configuration, right? As the default QPS is 40.

      --kube-api-burst int                                                                                                                                                                                       
                Burst to use while talking with karmada-apiserver. (default 60)
      --kube-api-qps float32                                                                                                                                                                                     
                QPS to use while talking with karmada-apiserver. (default 40)

vie-serendipity · 2026-01-20T09:58:44Z

It sounds like you have a higher qps configuration, right? As the default QPS is 40.

Yes, I set qps and burst to 600 and 100 respectively.

RainbowMango · 2026-01-21T01:50:55Z

The QPS setting reflects the API server's processing capacity, and what we can do is ensure that the total requests from all controllers stay below this threshold. I don't see a strong need for further optimization here.

vie-serendipity · 2026-01-21T02:13:32Z

what we can do is ensure that the total requests from all controllers stay below this threshold.

I'm not sure we're aligned.
The controller's requests should be watch requests and write requests (create/delete/update); all read requests (get/list) should come from the cache. But here a large number of read requests are going directly to the apiserver, which I believe is not expected. I believe the higher QPS setting is meant to allow the controller to make more write requests, not to perform direct reads.

RainbowMango · 2026-01-21T02:17:59Z

Really? My understanding is that both read/write requests are counted for QPS. @zach593 Can you confirm that?

vie-serendipity · 2026-01-21T02:56:06Z

Really? My understanding is that both read/write requests are counted for QPS.

I think you misunderstand me. I mean the controller's read requests should go to the cache (read requests to cache are not accounted for QPS) instead of directly to the API server (read requests to API server are accouted for QPS), while the informer’s list/watch requests are accouted for QPS.
The current problem is that the controller can't get data from the cache and falls back to the API server, directly reading and consuming QPS, which I think is not expected. The ideal behaviour is wait for cache synced and then read from cache.

RainbowMango · 2026-01-21T03:49:09Z

Yeah, I get it. Thank you.

An idea in my mind that lets controllers who are using the informer wait until the cache gets synced, but I doubt it is worth doing that, as the QPS is still under control, not going beyond the API server's capacity. Doing so, would slow down the controller start time.

zach593 · 2026-01-21T12:25:37Z

pkg/detector/detector.go

 			d.InformerManager.ForResource(r, d.EventHandler)
 		}
 		d.InformerManager.Start()
+		d.InformerManager.WaitForCacheSync()


Why is this wait needed here?

The implementation has some problems. I want to make sure whether this change is necessary.

zach593 · 2026-01-21T12:31:06Z

Really? My understanding is that both read/write requests are counted for QPS. @zach593 Can you confirm that?

It seems I don’t need to answer this anymore.

After considering and testing, this will bring a really long start-up time, leading that we have to configure long initialDelaySeconds for liveness probe. This is not good.

Would you like to explain a bit why the hasCache version caused a longer startup time than this version? @vie-serendipity

Or, can I understand it this way: the cache-sync time is roughly the same, but the hasCache approach affects the liveness probe result?

zach593 · 2026-01-21T12:49:21Z

I basically agree with your point that reconciliation should happen after the relevant caches have synced, but I don’t understand in what situation the dynamic client would be used. Do you mean this part?

karmada/pkg/detector/preemption.go

Lines 228 to 230 in d58cbe1

    
           func (d *ResourceDetector) fetchResourceTemplate(rs policyv1alpha1.ResourceSelector) (*unstructured.Unstructured, error) { 
        
           	resourceTemplate, err := helper.FetchResourceTemplate(context.TODO(), d.DynamicClient, d.InformerManager, d.RESTMapper, helper.ConstructObjectReference(rs)) 
        
           	if err != nil {

zach593 · 2026-01-21T12:54:24Z

Hi @RainbowMango , do you know the background of using the dynamic client as a backup in the past? I can see this pattern scattered across the repo, and some of the code paths are actually never reached.
Also, in our (ctrip) branch, these dynamic-client behaviors have already been removed as part of performance optimizations.

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 13, 2026

karmada-bot requested review from chaunceyjiang and jwcesign January 13, 2026 03:44

karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 13, 2026

gemini-code-assist bot reviewed Jan 13, 2026

View reviewed changes

vie-serendipity force-pushed the feat/detector-cache branch from 67853b6 to 754ec89 Compare January 15, 2026 09:06

karmada-bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 15, 2026

vie-serendipity force-pushed the feat/detector-cache branch from 754ec89 to 53b8ca6 Compare January 15, 2026 09:12

vie-serendipity force-pushed the feat/detector-cache branch from 53b8ca6 to 52b6bac Compare January 15, 2026 09:26

karmada-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jan 15, 2026

vie-serendipity force-pushed the feat/detector-cache branch from 52b6bac to 3e427ee Compare January 16, 2026 03:59

karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 16, 2026

vie-serendipity force-pushed the feat/detector-cache branch 2 times, most recently from 5bf7f21 to 3336800 Compare January 16, 2026 06:13

vie-serendipity force-pushed the feat/detector-cache branch from 3336800 to a37a0cf Compare January 16, 2026 08:35

feat: wait until cache sync to get lister

7ff0820

Signed-off-by: vie-serendipity <[email protected]>

vie-serendipity force-pushed the feat/detector-cache branch from a37a0cf to 7ff0820 Compare January 16, 2026 09:56

zach593 reviewed Jan 21, 2026

View reviewed changes

	func (d *ResourceDetector) GetCache() ctrlcache.Cache {
	// GetCache returns a cache that can be used to wait for the detector's informer cache to sync.
	func (d *ResourceDetector) GetCache() ctrlcache.Cache {

feat: informer should sync cache before other controllers start #7104

Are you sure you want to change the base?

feat: informer should sync cache before other controllers start #7104

Uh oh!

Conversation

vie-serendipity commented Jan 13, 2026

Uh oh!

gemini-code-assist bot commented Jan 13, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jan 13, 2026

Codecov Report

Uh oh!

RainbowMango commented Jan 14, 2026

Uh oh!

vie-serendipity commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karmada-bot commented Jan 16, 2026

Uh oh!

RainbowMango commented Jan 20, 2026

Uh oh!

vie-serendipity commented Jan 20, 2026

Uh oh!

RainbowMango commented Jan 20, 2026

Uh oh!

vie-serendipity commented Jan 20, 2026

Uh oh!

RainbowMango commented Jan 21, 2026

Uh oh!

vie-serendipity commented Jan 21, 2026

Uh oh!

RainbowMango commented Jan 21, 2026

Uh oh!

vie-serendipity commented Jan 21, 2026

Uh oh!

RainbowMango commented Jan 21, 2026

Uh oh!

zach593 Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

vie-serendipity Jan 21, 2026

Choose a reason for hiding this comment

Uh oh!

zach593 commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zach593 commented Jan 21, 2026

Uh oh!

zach593 commented Jan 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vie-serendipity commented Jan 15, 2026 •

edited

Loading

zach593 commented Jan 21, 2026 •

edited

Loading