Skip to content

Runtime log verbosity change #9351

@x13n

Description

@x13n

Which component are you using?:

/area cluster-autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

Operating and debugging Cluster Autoscaler typically involves accessing logs and metrics in retrospective - after a problem has occurred. Some information, however, is not accessible that way since it would generate a humongous amount of logs. Examples include things like content of node templates or a specific pods being processed at any given moment. To address this, we have a separate mechanism called debugging snapshot (#4346), which allows accessing the information from a single main loop iteration at runtime. However, there are a few problems with this approach:

  • It has to be timed exactly to the problematic loop iteration. If there's a problem happening every second loop, you have only 50% chance of capturing the data from such problematic loop.
  • It uses a completely different format (json file) than logs, so cannot be easily analyzed along with logs from the same time period.
  • It is very painful to extend - for any data structure to be added to the snapshot, significant code changes are required (plumbing required to pass the object to the right place).
  • It is all or nothing - it doesn't allow collecting a specific piece of data - all the parts are collected, which can be problematic at large scale.

Describe the solution you'd like.:

All the problems above could be addressed with dynamic log verbosity settings. Instead of requesting a snapshot from a single iteration, one could temporarily request logs verbosity change over HTTP, similarly to how a CPU profile is collected:

  • An endpoint would be exposed, allowing klog verbosity settings to be temporarily modified.
  • The endpoint would allow tweaking both -v and -vmodule flags. The latter would give the ability to enable specific log lines with surgical precision if needed.
  • There would be a parameter specifying how long the modified logging settings would be in effect. We'd need some sane default if not specified.

Once the proposed mechanism is available, we could sunset debugging snapshot, cleaning up the codebase a bit.

Describe any alternative solutions you've considered.:

Keep on building on top of debugging snapshot by adding new features there. This would be much harder to use and maintain than the proposed solution.

Additional context.

Naive, LLM generated implementation:

import (
	"flag"
	"fmt"
	"net/http"
	"time"

	"k8s.io/klog/v2"
)

func handleLogUpdate(w http.ResponseWriter, r *http.Request) {
	// 1. Capture current values to revert later
	oldV := flag.Lookup("v").Value.String()
	oldVModule := flag.Lookup("vmodule").Value.String()

	// 2. Parse inputs (e.g., ?v=5&vmodule=pvc*=4&duration=1m)
	newV := r.URL.Query().Get("v")
	newVModule := r.URL.Query().Get("vmodule")
	durationStr := r.URL.Query().Get("duration")

	duration, err := time.ParseDuration(durationStr)
	if err != nil || duration <= 0 {
		http.Error(w, "Invalid duration (e.g., 5m, 30s)", http.StatusBadRequest)
		return
	}

	// 3. Apply new settings
	if newV != "" {
		flag.Lookup("v").Value.Set(newV)
	}
	if newVModule != "" {
		flag.Lookup("vmodule").Value.Set(newVModule)
	}

	// 4. Schedule the revert
	time.AfterFunc(duration, func() {
		klog.Infof("Reverting logs to v=%s vmodule=%s", oldV, oldVModule)
		flag.Lookup("v").Value.Set(oldV)
		flag.Lookup("vmodule").Value.Set(oldVModule)
	})

	fmt.Fprintf(w, "Logs boosted for %s. Will revert to v=%s, vmodule=%s", duration, oldV, oldVModule)
}

Real implementation would likely be slightly more complicated to correctly handle multiple calls to the endpoint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions