-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: informer should sync cache before other controllers start #7104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat: informer should sync cache before other controllers start #7104
Conversation
Summary of ChangesHello @vie-serendipity, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a potential race condition where controllers might attempt to fetch resources from the API server before the local informer cache is fully populated. By ensuring that the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request modifies the ResourceDetector to ensure its informer cache is synchronized before other controllers start. This is achieved by introducing a blocking cache synchronization in initCache and providing a cacheAdapter that allows other components to wait for this synchronization.
My review focuses on a critical issue in the cacheAdapter implementation that could lead to a panic, a suggestion to refactor duplicated code for better maintainability, and a minor style guide violation regarding a missing comment on an exported method.
Overall, the approach is sound, but the implementation of cacheAdapter needs to be made safer to prevent potential runtime panics.
pkg/detector/detector.go
Outdated
| type cacheAdapter struct { | ||
| ctrlcache.Cache | ||
| syncedCh chan struct{} | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cacheAdapter struct embeds ctrlcache.Cache. When an instance is created with cacheAdapter{syncedCh: d.syncedCh}, the embedded field is a nil interface. If this cacheAdapter is used in a cache.MultiCache (a common pattern for combining caches in controller-runtime), any call to methods like Get or List will be dispatched to the embedded nil interface, causing a panic. This is a critical issue that could crash the controller manager.
To fix this, cacheAdapter should implement the ctrlcache.Cache interface without embedding it. The methods that are not intended to be used can return an error.
Here is a safe implementation:
import (
"context"
"errors"
"k8s.io/apimachinery/pkg/runtime/schema"
ctrlcache "sigs.k8s.io/controller-runtime/pkg/cache"
"sigs.k8s.io/controller-runtime/pkg/client"
)
type cacheAdapter struct {
syncedCh chan struct{}
}
// ensure cacheAdapter implements ctrlcache.Cache
var _ ctrlcache.Cache = cacheAdapter{}
func (c cacheAdapter) Get(context.Context, client.ObjectKey, client.Object, ...client.GetOption) error {
return errors.New("get is not supported")
}
func (c cacheAdapter) List(context.Context, client.ObjectList, ...client.ListOption) error {
return errors.New("list is not supported")
}
func (c cacheAdapter) GetInformer(context.Context, client.Object) (ctrlcache.Informer, error) {
return nil, errors.New("getInformer is not supported")
}
func (c cacheAdapter) GetInformerForKind(context.Context, schema.GroupVersionKind) (ctrlcache.Informer, error) {
return nil, errors.New("getInformerForKind is not supported")
}
func (c cacheAdapter) Start(context.Context) error {
return errors.New("start is not supported")
}
func (c cacheAdapter) IndexField(context.Context, client.Object, string, client.IndexerFunc) error {
return errors.New("indexField is not supported")
}
func (c cacheAdapter) WaitForCacheSync(ctx context.Context) bool {
select {
case <-ctx.Done():
return false
case <-c.syncedCh:
return true
}
}
pkg/detector/detector.go
Outdated
| return nil | ||
| } | ||
|
|
||
| func (d *ResourceDetector) GetCache() ctrlcache.Cache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exported function GetCache is missing a comment. According to the style guide, all exported functions must be documented. Please add a comment explaining what this function does.
| func (d *ResourceDetector) GetCache() ctrlcache.Cache { | |
| // GetCache returns a cache that can be used to wait for the detector's informer cache to sync. | |
| func (d *ResourceDetector) GetCache() ctrlcache.Cache { |
References
- All exported functions, methods, structs, and interfaces must be documented with clear and concise comments describing their purpose and behavior. (link)
pkg/detector/detector.go
Outdated
| newResources := lifted.GetDeletableResources(d.DiscoveryClientSet) | ||
| for r := range newResources { | ||
| if d.InformerManager.IsHandlerExist(r, d.EventHandler) || d.gvrDisabled(r) { | ||
| continue | ||
| } | ||
| klog.Infof("Setup informer for %s", r.String()) | ||
| d.InformerManager.ForResource(r, d.EventHandler) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for discovering resources and setting up informers is duplicated in initCache and discoverResources. This duplication can be extracted into a private helper method to improve code maintainability and reduce redundancy.
For example, you could create a setupInformersForNewResources method to encapsulate this logic.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7104 +/- ##
==========================================
- Coverage 46.55% 46.53% -0.02%
==========================================
Files 700 700
Lines 48103 48123 +20
==========================================
+ Hits 22395 22396 +1
- Misses 24028 24047 +19
Partials 1680 1680
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@vie-serendipity Can you elaborate on it? Which informer should be synced? And what's the side-effect? |
67853b6 to
754ec89
Compare
754ec89 to
53b8ca6
Compare
|
When there are a lot of workloads in the cluster, if the karmada controller manager restarts, the detector starts in parallel with the other controllers. This means that when a controller begins to reconcile on getting workloads, the detector's InformerManager may not have synced cache yet and will fall back to the API server. Frequent API server access increases the load of the API server. Therefore, it's reasonable to sync cache before real controllers start to work. So I think we can put detector to case hasCache. That's what I do before. switch runnable := fn.(type) {
case *Server:
case hasCache:
case webhook.Server:
case warmupRunnable, LeaderElectionRunnable:
default:
return r.LeaderElection.Add(fn, nil)
}After considering and testing, this will bring a really long start-up time, leading that we have to configure long initialDelaySeconds for liveness probe. This is not good. I came up with another approach: block while getting or listing the objects from the SingleClusterInformerManager until its cache is synced, just like the client in controller-runtime does. |
53b8ca6 to
52b6bac
Compare
52b6bac to
3e427ee
Compare
5bf7f21 to
3336800
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
3336800 to
a37a0cf
Compare
Signed-off-by: vie-serendipity <[email protected]>
a37a0cf to
7ff0820
Compare
Do you mean the
I understand that the controllers widely use this informer manager, but the new informers are registered inthe detector: karmada/pkg/detector/detector.go Lines 186 to 198 in fe621a2
How bad is it? |
It sounds like you have a higher qps configuration, right? As the default QPS is 40. |
Yes, I set qps and burst to 600 and 100 respectively. |
|
The QPS setting reflects the API server's processing capacity, and what we can do is ensure that the total requests from all controllers stay below this threshold. I don't see a strong need for further optimization here. |
I'm not sure we're aligned. |
|
Really? My understanding is that both read/write requests are counted for QPS. @zach593 Can you confirm that? |
I think you misunderstand me. I mean the controller's read requests should go to the cache (read requests to cache are not accounted for QPS) instead of directly to the API server (read requests to API server are accouted for QPS), while the informer’s list/watch requests are accouted for QPS. |
|
Yeah, I get it. Thank you. An idea in my mind that lets controllers who are using the informer wait until the cache gets synced, but I doubt it is worth doing that, as the QPS is still under control, not going beyond the API server's capacity. Doing so, would slow down the controller start time. |
| d.InformerManager.ForResource(r, d.EventHandler) | ||
| } | ||
| d.InformerManager.Start() | ||
| d.InformerManager.WaitForCacheSync() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this wait needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation has some problems. I want to make sure whether this change is necessary.
It seems I don’t need to answer this anymore.
Would you like to explain a bit why the hasCache version caused a longer startup time than this version? @vie-serendipity Or, can I understand it this way: the cache-sync time is roughly the same, but the hasCache approach affects the liveness probe result? |
|
I basically agree with your point that reconciliation should happen after the relevant caches have synced, but I don’t understand in what situation the dynamic client would be used. Do you mean this part? karmada/pkg/detector/preemption.go Lines 228 to 230 in d58cbe1
|
|
Hi @RainbowMango , do you know the background of using the dynamic client as a backup in the past? I can see this pattern scattered across the repo, and some of the code paths are actually never reached. |

What type of PR is this?
/kind feature
What this PR does / why we need it:
The cluster informer manager should sync the cache first to avoid directly fetching from the apiserver. This means detectors and other controllers have dependencies, so detectors should be placed in the controller-runtime cache case.
Avoid starting detectors and other controllers in a parallel way that causes the cache to be unavailable and other controllers directly query the apiserver.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: