Exit when stopped leading #8326

mu-soliman · 2025-07-16T09:17:38Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

In case of flapping failures that affect the leading pod like flapping network, we noticed that the same pod tried to reaquire the lock after losing it instead of allowing another healthy pod unaffected by the failure to acquire the lock and become leader.

This pull request changes the behaviour to work accroding to sig-instrumentation guidelines of avoiding using fatal and in accordance with other k8s components, like kube scheduler in leader election configuration, where if the leader pod stopped leading it exits the process and let another pod take leadership while k8s creates a new pod to replace the old leader.

Does this PR introduce a user-facing change?

NONE

if the leader pod stopped leading it exits the process and let another pod take leadership while k8s creates a new pod to replace the old leader.

In case of flapping failures that affected the leading pod like flapping network, we noticed that the same pod tried to reaquire the lock after losing it instead of allowing another helthy pod unaffected by the failure to aquire the lock and become leader. This pull request changes the behaviour to work in accordance with other k8s components in leader election configuration, where if the leader stopped leading it exits the process and let another pod take leadership while k8s creates a new pod to replace the old leader.

k8s-ci-robot · 2025-07-16T09:17:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mu-soliman
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-07-16T09:17:48Z

Hi @mu-soliman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jackfrancis · 2025-07-26T02:44:44Z

cluster-autoscaler/main.go

 				OnStoppedLeading: func() {
-					klog.Fatalf("lost master")
+					klog.Fatalf("lost master. Shutting down.")
+					klog.FlushAndExit(klog.ExitFlushTimeout, 1)


It doesn't look to me like this is meant to be used in addition to klog.Fatalf:

// FlushAndExit flushes log data for a certain amount of time and then calls // os.Exit. Combined with some logging call it provides a replacement for // traditional calls like Fatal or Exit.

In fact this is unreachable code as klog.Fatalf has the outcome of calling OsExit(255).

// Fatalf logs to the FATAL, ERROR, WARNING, and INFO logs, // including a stack trace of all running goroutines, then calls OsExit(255). // Arguments are handled in the manner of fmt.Printf; a newline is appended if missing. func Fatalf(format string, args ...interface{}) { logging.printf(severity.FatalLog, logging.logger, logging.filter, format, args...) }

This is not the current behaviour of klog.fatalf(), even if the docs say so. If you follow the code, logging.printf() has the following implementation:
func (l *loggingT) printf(s severity.Severity, logger *logWriter, filter LogFilter, format string, args ...interface{}) { l.printfDepth(s, logger, filter, 1, format, args...) }

and printfDepth has the following implementation:

`func (l *loggingT) printfDepth(s severity.Severity, logger *logWriter, filter LogFilter, depth int, format string, args ...interface{}) {
if false {
_ = fmt.Sprintf(format, args...) // cause vet to treat this function like fmt.Printf
}

buf, file, line := l.header(s, depth) // If a logger is set and doesn't support writing a formatted buffer, // we clear the generated header as we rely on the backing // logger implementation to print headers. if logger != nil && logger.writeKlogBuffer == nil { buffer.PutBuffer(buf) buf = buffer.GetBuffer() } if filter != nil { format, args = filter.FilterF(format, args) } fmt.Fprintf(buf, format, args...) if buf.Bytes()[buf.Len()-1] != '\n' { buf.WriteByte('\n') } l.output(s, logger, buf, depth, file, line, false)

}`

This behavior implements the sig-instrumentation guidelines of not using fatal.

Changed it to using klog.Error() for better clarity and conformity with the guidelines.

I don't know if it matters, but klog.Fatal does exit, see: https://github.com/kubernetes/klog/blob/e7125f792ea66a85818cfb45261c9e1acc585344/klog.go#L936-L964

Here's a test:

package main import ( "flag" "k8s.io/klog/v2" ) func main() { klog.InitFlags(nil) flag.Parse() defer klog.Flush() klog.Info("About to encounter a fatal error...") klog.Fatalf("This is a fatal error - the application will exit here!") // This line will never be reached klog.Info("This message will never be printed") }

$ go run main.go I0728 16:20:33.182323 99101 main.go:15] About to encounter a fatal error... F0728 16:20:33.182456 99101 main.go:16] This is a fatal error - the application will exit here! exit status 255

@mu-soliman As explained above, klog.Fatalf does exit: https://github.com/kubernetes/klog/blob/main/klog.go#L936

jackfrancis · 2025-07-28T14:16:07Z

/cherry-pick cluster-autoscaler-release-1.31
/cherry-pick cluster-autoscaler-release-1.32
/cherry-pick cluster-autoscaler-release-1.33

k8s-infra-cherrypick-robot · 2025-07-28T14:16:09Z

@jackfrancis: once the present PR merges, I will cherry-pick it on top of cluster-autoscaler-release-1.31, cluster-autoscaler-release-1.32, cluster-autoscaler-release-1.33 in new PRs and assign them to you.

In response to this:

/cherry-pick cluster-autoscaler-release-1.31
/cherry-pick cluster-autoscaler-release-1.32
/cherry-pick cluster-autoscaler-release-1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jackfrancis · 2025-07-28T14:19:01Z

/ok-to-test

aleksandra-malinowska · 2025-07-28T16:12:54Z

As discussed in other comments, klog.Fatalf() works as documented (at least with regard to exiting), so this PR doesn't change behaviour and doesn't fix anything. From personal experience I can confirm that I've observed CA crashing on losing leader election in prod many times.

It sounds like you've encountered some bug that led to creating this PR, but as it's unlikely that this specific line of the code is to blame for it, could you provide more details? Ideally, open a new issue and document the behaviour you're seeing.

If you still want to refactor the code to comply with the referenced guidelines of avoiding using Fatalf at all, please do the following:

Remove cherry-pick labels - cleanup is not something we want to cherry-pick.
Change PR kind from 'bug' to 'cleanup'.
Make it a mass-refactor - quick search shows 100+ usages of Fatalf in cluster-autoscaler/. Let's not spread it across 100+ PRs, since it's a purely mechanical refactor. I'm open to excluding cloud provider packages and leaving that to their maintainers, but even then, there are 10 places in main.go alone where Fatalf used, and some more in cluster-autoscaler/core/ and other packages.

aleksandra-malinowska · 2025-07-28T16:15:42Z

/hold

mu-soliman · 2025-07-30T09:00:11Z

closing it to submit a full PR to refactor all usages of klog.Fatal()

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 16, 2025

k8s-ci-robot added the area/cluster-autoscaler label Jul 16, 2025

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer July 16, 2025 09:17

jackfrancis requested changes Jul 26, 2025

View reviewed changes

Use exit instead.

e965990

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 28, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 28, 2025

mu-soliman closed this Jul 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exit when stopped leading #8326

Exit when stopped leading #8326

Uh oh!

mu-soliman commented Jul 16, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jul 16, 2025

Uh oh!

k8s-ci-robot commented Jul 16, 2025

Uh oh!

jackfrancis Jul 26, 2025

Uh oh!

mu-soliman Jul 28, 2025

Uh oh!

mu-soliman Jul 28, 2025

Uh oh!

adrianmoisey Jul 28, 2025

Uh oh!

aleksandra-malinowska Jul 28, 2025

Uh oh!

jackfrancis commented Jul 28, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Jul 28, 2025

Uh oh!

jackfrancis commented Jul 28, 2025

Uh oh!

aleksandra-malinowska commented Jul 28, 2025

Uh oh!

aleksandra-malinowska commented Jul 28, 2025

Uh oh!

mu-soliman commented Jul 30, 2025

Uh oh!

Uh oh!

Exit when stopped leading #8326

Exit when stopped leading #8326

Uh oh!

Conversation

mu-soliman commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Jul 16, 2025

Uh oh!

k8s-ci-robot commented Jul 16, 2025

Uh oh!

jackfrancis Jul 26, 2025

Choose a reason for hiding this comment

Uh oh!

mu-soliman Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

mu-soliman Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

adrianmoisey Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

aleksandra-malinowska Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

jackfrancis commented Jul 28, 2025

Uh oh!

k8s-infra-cherrypick-robot commented Jul 28, 2025

Uh oh!

jackfrancis commented Jul 28, 2025

Uh oh!

aleksandra-malinowska commented Jul 28, 2025

Uh oh!

aleksandra-malinowska commented Jul 28, 2025

Uh oh!

mu-soliman commented Jul 30, 2025

Uh oh!

Uh oh!

mu-soliman commented Jul 16, 2025 •

edited

Loading