-
Notifications
You must be signed in to change notification settings - Fork 1.4k
🐛fix ClusterCache doesn't pick latest kubeconfig secret proactively #12400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @mogliang! |
Hi @mogliang. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
4142631
to
49bbe5d
Compare
added unit test |
/ok-to-test |
@@ -511,6 +512,16 @@ func (cc *clusterCache) Reconcile(ctx context.Context, req reconcile.Request) (r | |||
requeueAfterDurations = append(requeueAfterDurations, accessor.config.HealthProbe.Interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are hitting tooManyConsecutiveFailures in your scenario, right?
Would it also be enough for your use case to make HealthProbe.Timeout/Interval/FailureThreshold configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really.
We have proxy between mgmt cluster and target clusters.
we see after kubeconfig updated (proxy address changed), the existing connection (clustercache probe) still works, so it doesn't refetch kubeconfig and still cache the old one, but new connections fails (etcd client)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the health check would open a new connection it would detect it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct~ another idea is to disconnect for a given time (e.g. 5m) period to force refresh connection. will this be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just try to extend the health check that it does two requests. One over the existing connection one with a new connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client-go will reuse the underlying transport (except create spdy transport).
Do you know where that happens? Somewhere in rest.HTTPClientFor or rest.TransportFor or transport.New?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in transport.New
if restconfig has .Tranport, it will just use it, if not, it will get from cache.
// New returns an http.RoundTripper that will provide the authentication
// or transport level security defined by the provided Config.
func New(config *Config) (http.RoundTripper, error) {
// Set transport level security
if config.Transport != nil && (config.HasCA() || config.HasCertAuth() || config.HasCertCallback() || config.TLS.Insecure) {
return nil, fmt.Errorf("using a custom transport with TLS certificate options or the insecure flag is not allowed")
}
if !isValidHolders(config) {
return nil, fmt.Errorf("misconfigured holder for dialer or cert callback")
}
var (
rt http.RoundTripper
err error
)
if config.Transport != nil {
rt = config.Transport
} else {
rt, err = tlsCache.get(config)
if err != nil {
return nil, err
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if there is a good way to pass in a transport, without duplicating too much code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let it check it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer
https://github.com/kubernetes-sigs/cluster-api/compare/release-1.9...mogliang:cluster-api:dev/qliang/v1.9.5-fixetcd2?expand=1
copied the implemenation from tlscache.get.
need mention that, establish tcp connection takes ~1sec in our case, and normal ping may take ~100sec. so, this do adds some reconcile time.
2a67216
to
36e1ea5
Compare
730e1a5
to
754d34a
Compare
/hold |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #12399
/area clustercache