Skip to content

Conversation

mu-soliman
Copy link
Contributor

@mu-soliman mu-soliman commented Jul 16, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

In case of flapping failures that affect the leading pod like flapping network, we noticed that the same pod tried to reaquire the lock after losing it instead of allowing another healthy pod unaffected by the failure to acquire the lock and become leader.

This pull request changes the behaviour to work accroding to sig-instrumentation guidelines of avoiding using fatal and in accordance with other k8s components, like kube scheduler in leader election configuration, where if the leader pod stopped leading it exits the process and let another pod take leadership while k8s creates a new pod to replace the old leader.

Does this PR introduce a user-facing change?

NONE

if the leader pod stopped leading it exits the process and let another pod take leadership while k8s creates a new pod to replace the old leader.

In case of flapping failures that affected the leading pod like flapping network, we noticed that the same pod tried to reaquire the lock after losing it instead of allowing another helthy pod unaffected by the failure to aquire the lock and become leader.

This pull request changes the behaviour to work in accordance with other k8s components in leader election configuration, where if the leader stopped leading it exits the process and let another pod take leadership while k8s creates a new pod to replace the old leader.
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area labels Jul 16, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mu-soliman
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 16, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @mu-soliman. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/needs-area do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jul 16, 2025
OnStoppedLeading: func() {
klog.Fatalf("lost master")
klog.Fatalf("lost master. Shutting down.")
klog.FlushAndExit(klog.ExitFlushTimeout, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look to me like this is meant to be used in addition to klog.Fatalf:

// FlushAndExit flushes log data for a certain amount of time and then calls
// os.Exit. Combined with some logging call it provides a replacement for
// traditional calls like Fatal or Exit.

In fact this is unreachable code as klog.Fatalf has the outcome of calling OsExit(255).

// Fatalf logs to the FATAL, ERROR, WARNING, and INFO logs,
// including a stack trace of all running goroutines, then calls OsExit(255).
// Arguments are handled in the manner of fmt.Printf; a newline is appended if missing.
func Fatalf(format string, args ...interface{}) {
	logging.printf(severity.FatalLog, logging.logger, logging.filter, format, args...)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the current behaviour of klog.fatalf(), even if the docs say so. If you follow the code, logging.printf() has the following implementation:
func (l *loggingT) printf(s severity.Severity, logger *logWriter, filter LogFilter, format string, args ...interface{}) { l.printfDepth(s, logger, filter, 1, format, args...) }

and printfDepth has the following implementation:

`func (l *loggingT) printfDepth(s severity.Severity, logger *logWriter, filter LogFilter, depth int, format string, args ...interface{}) {
if false {
_ = fmt.Sprintf(format, args...) // cause vet to treat this function like fmt.Printf
}

buf, file, line := l.header(s, depth)
// If a logger is set and doesn't support writing a formatted buffer,
// we clear the generated header as we rely on the backing
// logger implementation to print headers.
if logger != nil && logger.writeKlogBuffer == nil {
	buffer.PutBuffer(buf)
	buf = buffer.GetBuffer()
}
if filter != nil {
	format, args = filter.FilterF(format, args)
}
fmt.Fprintf(buf, format, args...)
if buf.Bytes()[buf.Len()-1] != '\n' {
	buf.WriteByte('\n')
}
l.output(s, logger, buf, depth, file, line, false)

}`

This behavior implements the sig-instrumentation guidelines of not using fatal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to using klog.Error() for better clarity and conformity with the guidelines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it matters, but klog.Fatal does exit, see: https://github.com/kubernetes/klog/blob/e7125f792ea66a85818cfb45261c9e1acc585344/klog.go#L936-L964

Here's a test:

package main

import (
	"flag"

	"k8s.io/klog/v2"
)

func main() {
	klog.InitFlags(nil)
	flag.Parse()

	defer klog.Flush()

	klog.Info("About to encounter a fatal error...")
	klog.Fatalf("This is a fatal error - the application will exit here!")

	// This line will never be reached
	klog.Info("This message will never be printed")
}
$ go run main.go
I0728 16:20:33.182323   99101 main.go:15] About to encounter a fatal error...
F0728 16:20:33.182456   99101 main.go:16] This is a fatal error - the application will exit here!
exit status 255

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mu-soliman As explained above, klog.Fatalf does exit: https://github.com/kubernetes/klog/blob/main/klog.go#L936

@jackfrancis
Copy link
Contributor

/cherry-pick cluster-autoscaler-release-1.31
/cherry-pick cluster-autoscaler-release-1.32
/cherry-pick cluster-autoscaler-release-1.33

@k8s-infra-cherrypick-robot

@jackfrancis: once the present PR merges, I will cherry-pick it on top of cluster-autoscaler-release-1.31, cluster-autoscaler-release-1.32, cluster-autoscaler-release-1.33 in new PRs and assign them to you.

In response to this:

/cherry-pick cluster-autoscaler-release-1.31
/cherry-pick cluster-autoscaler-release-1.32
/cherry-pick cluster-autoscaler-release-1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jackfrancis
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 28, 2025
@aleksandra-malinowska
Copy link
Contributor

As discussed in other comments, klog.Fatalf() works as documented (at least with regard to exiting), so this PR doesn't change behaviour and doesn't fix anything. From personal experience I can confirm that I've observed CA crashing on losing leader election in prod many times.

It sounds like you've encountered some bug that led to creating this PR, but as it's unlikely that this specific line of the code is to blame for it, could you provide more details? Ideally, open a new issue and document the behaviour you're seeing.

If you still want to refactor the code to comply with the referenced guidelines of avoiding using Fatalf at all, please do the following:

  1. Remove cherry-pick labels - cleanup is not something we want to cherry-pick.
  2. Change PR kind from 'bug' to 'cleanup'.
  3. Make it a mass-refactor - quick search shows 100+ usages of Fatalf in cluster-autoscaler/. Let's not spread it across 100+ PRs, since it's a purely mechanical refactor. I'm open to excluding cloud provider packages and leaving that to their maintainers, but even then, there are 10 places in main.go alone where Fatalf used, and some more in cluster-autoscaler/core/ and other packages.

@aleksandra-malinowska
Copy link
Contributor

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 28, 2025
@mu-soliman
Copy link
Contributor Author

closing it to submit a full PR to refactor all usages of klog.Fatal()

@mu-soliman mu-soliman closed this Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants