Skip to content

feat: Make metrics stale time configurable #1046

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nayihz
Copy link
Contributor

@nayihz nayihz commented Jun 23, 2025

fix: #336
changes ref: #336 (comment)

@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 23, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nayihz
Once this PR has been reviewed and has the lgtm label, please assign arangogutierrez for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and robscott June 23, 2025 12:33
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 23, 2025
Copy link

netlify bot commented Jun 23, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 8ca3b92
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6888342c5f7e3500082c060e
😎 Deploy Preview https://deploy-preview-1046--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@nayihz nayihz force-pushed the feat_metric_stale_time branch from 12f8bfe to 2d42a53 Compare June 23, 2025 12:35
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2025
@nayihz nayihz force-pushed the feat_metric_stale_time branch from a339897 to 1005486 Compare June 25, 2025 05:27
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2025
@nayihz nayihz force-pushed the feat_metric_stale_time branch from 1005486 to 9b1e7e2 Compare June 25, 2025 05:28
@nayihz nayihz marked this pull request as ready for review June 25, 2025 09:22
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2025
@k8s-ci-robot k8s-ci-robot requested a review from danehans June 25, 2025 09:22
@nayihz nayihz force-pushed the feat_metric_stale_time branch from 9b1e7e2 to 12943a7 Compare June 25, 2025 09:29
@nayihz
Copy link
Contributor Author

nayihz commented Jun 25, 2025

/cc @liu-cong

@k8s-ci-robot k8s-ci-robot requested a review from liu-cong June 25, 2025 09:48
@nayihz nayihz force-pushed the feat_metric_stale_time branch from 12943a7 to 518655c Compare June 29, 2025 07:11
@nayihz nayihz force-pushed the feat_metric_stale_time branch from 518655c to e54be57 Compare June 29, 2025 13:31
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 30, 2025
@nayihz
Copy link
Contributor Author

nayihz commented Jul 1, 2025

I found that it becomes very inconvenient to write unit tests after updating PodGetAll to PodGetAllWithFreshMetrics. But after reading the code in depth, I still couldn't come up with a good solution. Any ideas on this? @nirrozenbaum @liu-cong
https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1046/files#diff-1b7741fc131b712835ea0040fe1dc86b62403c0b124f0d672ef8bfadb84d32d3R325-R328

https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1046/files#diff-1b7741fc131b712835ea0040fe1dc86b62403c0b124f0d672ef8bfadb84d32d3R353

@nirrozenbaum
Copy link
Contributor

I found that it becomes very inconvenient to write unit tests after updating PodGetAll to PodGetAllWithFreshMetrics. But after reading the code in depth, I still couldn't come up with a good solution. Any ideas on this? @nirrozenbaum @liu-cong https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1046/files#diff-1b7741fc131b712835ea0040fe1dc86b62403c0b124f0d672ef8bfadb84d32d3R325-R328

https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1046/files#diff-1b7741fc131b712835ea0040fe1dc86b62403c0b124f0d672ef8bfadb84d32d3R353

@nayihz I don’t want to nitpick too much, but to be honest I’m not sure why the interface change was required.
we have (today, before this PR) in datastore PodGetAll and PodList(predicate).
couldn’t we implement the “get pod with fresh metrics” with PodList(predicate == function to return only fresh pod)?

@nirrozenbaum
Copy link
Contributor

nirrozenbaum commented Jul 1, 2025

I mean - to leave PodGetAll function as is.. and use the ListPod with that predicate only in the specific places it’s needed. would that help?

@nayihz nayihz force-pushed the feat_metric_stale_time branch 2 times, most recently from 6bde389 to bff4272 Compare July 2, 2025 02:38
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 2, 2025
@nayihz
Copy link
Contributor Author

nayihz commented Jul 2, 2025

to leave PodGetAll function as is.. and use the ListPod with that predicate only in the specific places it’s needed.

make sense to me.

@nayihz nayihz force-pushed the feat_metric_stale_time branch 2 times, most recently from 1fbad64 to 75047b3 Compare July 2, 2025 03:14
@nayihz
Copy link
Contributor Author

nayihz commented Jul 2, 2025

if !found {
return schedulingtypes.ToSchedulerPodMetrics(d.datastore.PodGetAll())
}
// Check if endpoint key is present in the subset map and ensure there is at least one value
endpointSubsetList, found := subsetMap[subsetHintKey].([]any)
if !found {
return schedulingtypes.ToSchedulerPodMetrics(d.datastore.PodGetAll())

Will change d.datastore.PodGetAll to d.datastore.PodList(backendmetrics.FreshMetricsFn) in a follow up because we should refractor the unit test. @nirrozenbaum

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2025
@nayihz nayihz requested review from nirrozenbaum and liu-cong July 10, 2025 01:16
@kfswain
Copy link
Collaborator

kfswain commented Jul 24, 2025

Heya @nayihz do you mind rebasing? I can take a look tomorrow once this is up to date

@nayihz nayihz force-pushed the feat_metric_stale_time branch from 75047b3 to dfbfce2 Compare July 24, 2025 11:14
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 24, 2025
@nayihz nayihz force-pushed the feat_metric_stale_time branch from dfbfce2 to 32ddd96 Compare July 24, 2025 11:20
@nayihz
Copy link
Contributor Author

nayihz commented Jul 24, 2025

Sorry for the delay here, finally got some time to rebase this, PTAL.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2025
@liu-cong
Copy link
Contributor

@nayihz Thanks again for working on this!

My suggestion is to do this in 2 steps (which seems like you are already doing).

  1. Step 1 is to do the refactor and plumbing for the new config, but just don't change the consumer logic yet (so they still get all pods despite the staleness of the metrics).
  2. Step 2 migrates consumers to the new logic (should be just a few lines of change). This change carries non-zero risk IMO. If things go wrong, we can roll back this change much easier.

Happy to chat on Slack or jump on a call :)

@nayihz nayihz force-pushed the feat_metric_stale_time branch from 32ddd96 to 8ca3b92 Compare July 29, 2025 02:38
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2025
@nayihz
Copy link
Contributor Author

nayihz commented Jul 29, 2025

@nayihz Thanks again for working on this!

My suggestion is to do this in 2 steps (which seems like you are already doing).

  1. Step 1 is to do the refactor and plumbing for the new config, but just don't change the consumer logic yet (so they still get all pods despite the staleness of the metrics).
  2. Step 2 migrates consumers to the new logic (should be just a few lines of change). This change carries non-zero risk IMO. If things go wrong, we can roll back this change much easier.

Happy to chat on Slack or jump on a call :)

Do you mean that implement the step1 and step2 in two separate commits?


// metrics related flags
refreshMetricsInterval = flag.Duration(
"refreshMetricsInterval",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls use refresh-metrics-interval to be consistent, same for the other two metrics related flags.

metricsStalenessThreshold = flag.Duration("metricsStalenessThreshold",
config.DefaultMetricsStalenessThreshold,
"Duration after which metrics are considered stale. This is used to determine if a pod's metrics "+
"are fresh enough to be used for scheduling decisions.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"are fresh enough to be used for scheduling decisions.")
"are fresh enough.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics are not just limited to scheduling. I think just keeping it a bit broad is OK.

// are considered stale.
// The staleness is determined by the refresh internal plus the latency of the metrics API.
// To be on the safer side, we start with a larger threshold.
DefaultMetricsStalenessThreshold = 2 * time.Second // default for --metricsStalenessThreshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DefaultMetricsStalenessThreshold = 2 * time.Second // default for --metricsStalenessThreshold
DefaultMetricsStalenessThreshold = 2 * time.Second

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing these comments, they can get out dated easily (e.g., change of flag names). No need to comment as one can easily find the reference

@@ -172,7 +173,9 @@ func diffStore(datastore datastore.Datastore, params diffStoreParams) string {
params.wantPods = []string{}
}
gotPods := []string{}
for _, pm := range datastore.PodGetAll() {
for _, pm := range datastore.PodList(func(backendmetrics.PodMetrics) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you create two "constant" predicates in the datastore like so:

var AllPodPredicate  = func(backendmetrics.PodMetrics) bool { return true}

var PodWithFreshMetrics =  func(pm backendmetrics.PodMetrics) bool { return time.Since(pm.GetMetrics().UpdateTime) <= pm.GetMetricsStalenessThreshold()}

@@ -62,7 +63,7 @@ func (c *inferencePoolMetricsCollector) Collect(ch chan<- prometheus.Metric) {
return
}

podMetrics := c.ds.PodGetAll()
podMetrics := c.ds.PodList(backendmetrics.FreshMetricsFn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This what I meant by the "2 step" approach. Here we will still get all pods, and in a separate PR (step2) we update the callers to only get fresh metrics. Given the non-zero risk of this change, and the large number of files this PR touched, rolling back will be very challenging if we ever need to do that.

@@ -108,7 +102,7 @@ func NewDetector(config *Config, datastore Datastore, logger logr.Logger) *Detec
// (no capacity).
func (d *Detector) IsSaturated(ctx context.Context) bool {
logger := log.FromContext(ctx).WithName(loggerName)
allPodsMetrics := d.datastore.PodGetAll()
allPodsMetrics := d.datastore.PodList(backendmetrics.FreshMetricsFn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest not changing this for now and add a TODO to change this in a separate PR

@@ -80,19 +83,21 @@ const (
DefaultCertPath = "" // default for --cert-path
DefaultConfigFile = "" // default for --config-file
DefaultConfigText = "" // default for --config-text
DefaultMetricsStalenessThreshold = 200 * time.Millisecond // default for --metricsStalenessThreshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove all these constants as they are now in the common/config package?

pmc PodMetricsClient
ds Datastore
interval time.Duration
stalenessThreshold time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the stalenessThreshold should be a property of the podMetrics. Here's my thought:

  • podMetrics is a low level implementation that refreshes the metrics. It shouldn't make the decision whether a metrics is fresh or not. And in the current implementation, it's not using the threshold at all except the Get method.
  • Ultimately it's the caller to decide whether to use a metrics if it's stale. For example, in the GetRandomPod method it may not care about the stale metrics.
  • We provide helper methods in the datastore that forces the caller to consider the metrics staleness (as it's not always obvious), so perhaps having this property in the datastore.go makes more sense.

@@ -63,7 +63,7 @@ type Datastore interface {
ModelGetAll() []*v1alpha2.InferenceModel

// PodMetrics operations
// PodGetAll returns all pods and metrics, including fresh and stale.
// PodGetAll returns all pods with stale and fresh metrics, only for testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't keep a method just for testing (that means we don't need to test it if no one is calling it). I think we can either:

  1. Keep PodGetAll in this PR, and keep the existing calls to this method, and clean up in a separate PR; OR
  2. Replace PodGetAll with PodList(AllPodPredicate)

},
},
storePods: []*corev1.Pod{pod1, pod2, pod3},
want: []*backendmetrics.MetricsState{pod1Metrics, pod2Metrics}, // pod3 metrics were stale and should not be included.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the test setup is correct. This test case simply creates a pod metric client with only pod1 and pod3, and as a result pod3 is not included, but not because its metrics is stale. To correctly test the behavior, I would expect a backendmetrics.MetricsState object is created with different update timestamps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make metrics stale time configurable
5 participants