-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Exit when stopped leading #8326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In case of flapping failures that affected the leading pod like flapping network, we noticed that the same pod tried to reaquire the lock after losing it instead of allowing another helthy pod unaffected by the failure to aquire the lock and become leader. This pull request changes the behaviour to work in accordance with other k8s components in leader election configuration, where if the leader stopped leading it exits the process and let another pod take leadership while k8s creates a new pod to replace the old leader.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mu-soliman The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @mu-soliman. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
OnStoppedLeading: func() { | ||
klog.Fatalf("lost master") | ||
klog.Fatalf("lost master. Shutting down.") | ||
klog.FlushAndExit(klog.ExitFlushTimeout, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look to me like this is meant to be used in addition to klog.Fatalf
:
// FlushAndExit flushes log data for a certain amount of time and then calls
// os.Exit. Combined with some logging call it provides a replacement for
// traditional calls like Fatal or Exit.
In fact this is unreachable code as klog.Fatalf
has the outcome of calling OsExit(255)
.
// Fatalf logs to the FATAL, ERROR, WARNING, and INFO logs,
// including a stack trace of all running goroutines, then calls OsExit(255).
// Arguments are handled in the manner of fmt.Printf; a newline is appended if missing.
func Fatalf(format string, args ...interface{}) {
logging.printf(severity.FatalLog, logging.logger, logging.filter, format, args...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the current behaviour of klog.fatalf(), even if the docs say so. If you follow the code, logging.printf() has the following implementation:
func (l *loggingT) printf(s severity.Severity, logger *logWriter, filter LogFilter, format string, args ...interface{}) { l.printfDepth(s, logger, filter, 1, format, args...) }
and printfDepth has the following implementation:
`func (l *loggingT) printfDepth(s severity.Severity, logger *logWriter, filter LogFilter, depth int, format string, args ...interface{}) {
if false {
_ = fmt.Sprintf(format, args...) // cause vet to treat this function like fmt.Printf
}
buf, file, line := l.header(s, depth)
// If a logger is set and doesn't support writing a formatted buffer,
// we clear the generated header as we rely on the backing
// logger implementation to print headers.
if logger != nil && logger.writeKlogBuffer == nil {
buffer.PutBuffer(buf)
buf = buffer.GetBuffer()
}
if filter != nil {
format, args = filter.FilterF(format, args)
}
fmt.Fprintf(buf, format, args...)
if buf.Bytes()[buf.Len()-1] != '\n' {
buf.WriteByte('\n')
}
l.output(s, logger, buf, depth, file, line, false)
}`
This behavior implements the sig-instrumentation guidelines of not using fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to using klog.Error() for better clarity and conformity with the guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if it matters, but klog.Fatal does exit, see: https://github.com/kubernetes/klog/blob/e7125f792ea66a85818cfb45261c9e1acc585344/klog.go#L936-L964
Here's a test:
package main
import (
"flag"
"k8s.io/klog/v2"
)
func main() {
klog.InitFlags(nil)
flag.Parse()
defer klog.Flush()
klog.Info("About to encounter a fatal error...")
klog.Fatalf("This is a fatal error - the application will exit here!")
// This line will never be reached
klog.Info("This message will never be printed")
}
$ go run main.go
I0728 16:20:33.182323 99101 main.go:15] About to encounter a fatal error...
F0728 16:20:33.182456 99101 main.go:16] This is a fatal error - the application will exit here!
exit status 255
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mu-soliman As explained above, klog.Fatalf does exit: https://github.com/kubernetes/klog/blob/main/klog.go#L936
/cherry-pick cluster-autoscaler-release-1.31 |
@jackfrancis: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
As discussed in other comments, klog.Fatalf() works as documented (at least with regard to exiting), so this PR doesn't change behaviour and doesn't fix anything. From personal experience I can confirm that I've observed CA crashing on losing leader election in prod many times. It sounds like you've encountered some bug that led to creating this PR, but as it's unlikely that this specific line of the code is to blame for it, could you provide more details? Ideally, open a new issue and document the behaviour you're seeing. If you still want to refactor the code to comply with the referenced guidelines of avoiding using Fatalf at all, please do the following:
|
/hold |
closing it to submit a full PR to refactor all usages of klog.Fatal() |
What type of PR is this?
/kind bug
What this PR does / why we need it:
In case of flapping failures that affect the leading pod like flapping network, we noticed that the same pod tried to reaquire the lock after losing it instead of allowing another healthy pod unaffected by the failure to acquire the lock and become leader.
This pull request changes the behaviour to work accroding to sig-instrumentation guidelines of avoiding using fatal and in accordance with other k8s components, like kube scheduler in leader election configuration, where if the leader pod stopped leading it exits the process and let another pod take leadership while k8s creates a new pod to replace the old leader.
Does this PR introduce a user-facing change?
NONE