feat: Make metrics stale time configurable #1046

nayihz · 2025-06-23T12:33:04Z

k8s-ci-robot · 2025-06-23T12:33:06Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2025-06-23T12:33:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: nayihz
Once this PR has been reviewed and has the lgtm label, please assign arangogutierrez for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-06-23T12:34:17Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`8ca3b92`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/6888342c5f7e3500082c060e
😎 Deploy Preview	https://deploy-preview-1046--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

nayihz · 2025-06-25T09:47:57Z

/cc @liu-cong

pkg/epp/backend/metrics/types.go

cmd/epp/runner/runner.go

pkg/epp/common/config/defaults.go

pkg/epp/server/runserver.go

pkg/epp/backend/metrics/logger.go

pkg/epp/scheduling/scheduler.go

nayihz · 2025-07-01T13:22:15Z

I found that it becomes very inconvenient to write unit tests after updating PodGetAll to PodGetAllWithFreshMetrics. But after reading the code in depth, I still couldn't come up with a good solution. Any ideas on this? @nirrozenbaum @liu-cong
https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1046/files#diff-1b7741fc131b712835ea0040fe1dc86b62403c0b124f0d672ef8bfadb84d32d3R325-R328

https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1046/files#diff-1b7741fc131b712835ea0040fe1dc86b62403c0b124f0d672ef8bfadb84d32d3R353

nirrozenbaum · 2025-07-01T14:10:56Z

I found that it becomes very inconvenient to write unit tests after updating PodGetAll to PodGetAllWithFreshMetrics. But after reading the code in depth, I still couldn't come up with a good solution. Any ideas on this? @nirrozenbaum @liu-cong https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1046/files#diff-1b7741fc131b712835ea0040fe1dc86b62403c0b124f0d672ef8bfadb84d32d3R325-R328

https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/1046/files#diff-1b7741fc131b712835ea0040fe1dc86b62403c0b124f0d672ef8bfadb84d32d3R353

@nayihz I don’t want to nitpick too much, but to be honest I’m not sure why the interface change was required.
we have (today, before this PR) in datastore PodGetAll and PodList(predicate).
couldn’t we implement the “get pod with fresh metrics” with PodList(predicate == function to return only fresh pod)?

nirrozenbaum · 2025-07-01T14:15:49Z

I mean - to leave PodGetAll function as is.. and use the ListPod with that predicate only in the specific places it’s needed. would that help?

nayihz · 2025-07-02T02:45:15Z

to leave PodGetAll function as is.. and use the ListPod with that predicate only in the specific places it’s needed.

make sense to me.

nayihz · 2025-07-02T05:40:37Z

gateway-api-inference-extension/pkg/epp/requestcontrol/director.go

Lines 197 to 204 in 2ae867d

    
           if !found { 
        
           	return schedulingtypes.ToSchedulerPodMetrics(d.datastore.PodGetAll()) 
        
           } 
        
           // Check if endpoint key is present in the subset map and ensure there is at least one value 
        
           endpointSubsetList, found := subsetMap[subsetHintKey].([]any) 
        
           if !found { 
        
           	return schedulingtypes.ToSchedulerPodMetrics(d.datastore.PodGetAll())

Will change d.datastore.PodGetAll to d.datastore.PodList(backendmetrics.FreshMetricsFn) in a follow up because we should refractor the unit test. @nirrozenbaum

kfswain · 2025-07-24T04:13:03Z

Heya @nayihz do you mind rebasing? I can take a look tomorrow once this is up to date

nayihz · 2025-07-24T11:47:46Z

Sorry for the delay here, finally got some time to rebase this, PTAL.

liu-cong · 2025-07-28T17:52:20Z

@nayihz Thanks again for working on this!

My suggestion is to do this in 2 steps (which seems like you are already doing).

Step 1 is to do the refactor and plumbing for the new config, but just don't change the consumer logic yet (so they still get all pods despite the staleness of the metrics).
Step 2 migrates consumers to the new logic (should be just a few lines of change). This change carries non-zero risk IMO. If things go wrong, we can roll back this change much easier.

Happy to chat on Slack or jump on a call :)

…only fresh metrics

nayihz · 2025-07-29T02:58:43Z

@nayihz Thanks again for working on this!

My suggestion is to do this in 2 steps (which seems like you are already doing).

Step 1 is to do the refactor and plumbing for the new config, but just don't change the consumer logic yet (so they still get all pods despite the staleness of the metrics).

Step 2 migrates consumers to the new logic (should be just a few lines of change). This change carries non-zero risk IMO. If things go wrong, we can roll back this change much easier.

Happy to chat on Slack or jump on a call :)

Do you mean that implement the step1 and step2 in two separate commits?

liu-cong · 2025-07-29T03:26:42Z

cmd/epp/runner/runner.go

+
+	// metrics related flags
+	refreshMetricsInterval = flag.Duration(
+		"refreshMetricsInterval",


pls use refresh-metrics-interval to be consistent, same for the other two metrics related flags.

liu-cong · 2025-07-29T03:27:03Z

cmd/epp/runner/runner.go

+	metricsStalenessThreshold = flag.Duration("metricsStalenessThreshold",
+		config.DefaultMetricsStalenessThreshold,
+		"Duration after which metrics are considered stale. This is used to determine if a pod's metrics "+
+			"are fresh enough to be used for scheduling decisions.")


Suggested change

"are fresh enough to be used for scheduling decisions.")

"are fresh enough.")

Metrics are not just limited to scheduling. I think just keeping it a bit broad is OK.

liu-cong · 2025-07-29T03:32:34Z

pkg/epp/common/config/defaults.go

+	// are considered stale.
+	// The staleness is determined by the refresh internal plus the latency of the metrics API.
+	// To be on the safer side, we start with a larger threshold.
+	DefaultMetricsStalenessThreshold                = 2 * time.Second                  // default for --metricsStalenessThreshold


Suggested change

DefaultMetricsStalenessThreshold = 2 * time.Second // default for --metricsStalenessThreshold

DefaultMetricsStalenessThreshold = 2 * time.Second

I suggest removing these comments, they can get out dated easily (e.g., change of flag names). No need to comment as one can easily find the reference

liu-cong · 2025-07-29T03:40:26Z

pkg/epp/controller/inferencepool_reconciler_test.go

@@ -172,7 +173,9 @@ func diffStore(datastore datastore.Datastore, params diffStoreParams) string {
 		params.wantPods = []string{}
 	}
 	gotPods := []string{}
-	for _, pm := range datastore.PodGetAll() {
+	for _, pm := range datastore.PodList(func(backendmetrics.PodMetrics) bool {


nit: Can you create two "constant" predicates in the datastore like so:

var AllPodPredicate = func(backendmetrics.PodMetrics) bool { return true} var PodWithFreshMetrics = func(pm backendmetrics.PodMetrics) bool { return time.Since(pm.GetMetrics().UpdateTime) <= pm.GetMetricsStalenessThreshold()}

liu-cong · 2025-07-29T03:43:41Z

pkg/epp/metrics/collectors/inference_pool.go

@@ -62,7 +63,7 @@ func (c *inferencePoolMetricsCollector) Collect(ch chan<- prometheus.Metric) {
 		return
 	}

-	podMetrics := c.ds.PodGetAll()
+	podMetrics := c.ds.PodList(backendmetrics.FreshMetricsFn)


This what I meant by the "2 step" approach. Here we will still get all pods, and in a separate PR (step2) we update the callers to only get fresh metrics. Given the non-zero risk of this change, and the large number of files this PR touched, rolling back will be very challenging if we ever need to do that.

liu-cong · 2025-07-29T03:48:22Z

pkg/epp/saturationdetector/saturationdetector.go

@@ -108,7 +102,7 @@ func NewDetector(config *Config, datastore Datastore, logger logr.Logger) *Detec
 // (no capacity).
 func (d *Detector) IsSaturated(ctx context.Context) bool {
 	logger := log.FromContext(ctx).WithName(loggerName)
-	allPodsMetrics := d.datastore.PodGetAll()
+	allPodsMetrics := d.datastore.PodList(backendmetrics.FreshMetricsFn)


I suggest not changing this for now and add a TODO to change this in a separate PR

liu-cong · 2025-07-29T03:51:24Z

pkg/epp/server/runserver.go

@@ -80,19 +83,21 @@ const (
 	DefaultCertPath                                 = ""                               // default for --cert-path
 	DefaultConfigFile                               = ""                               // default for --config-file
 	DefaultConfigText                               = ""                               // default for --config-text
+	DefaultMetricsStalenessThreshold                = 200 * time.Millisecond           // default for --metricsStalenessThreshold


Shall we remove all these constants as they are now in the common/config package?

liu-cong · 2025-07-29T04:01:31Z

pkg/epp/backend/metrics/pod_metrics.go

+	pmc                PodMetricsClient
+	ds                 Datastore
+	interval           time.Duration
+	stalenessThreshold time.Duration


I don't think the stalenessThreshold should be a property of the podMetrics. Here's my thought:

podMetrics is a low level implementation that refreshes the metrics. It shouldn't make the decision whether a metrics is fresh or not. And in the current implementation, it's not using the threshold at all except the Get method.

Ultimately it's the caller to decide whether to use a metrics if it's stale. For example, in the GetRandomPod method it may not care about the stale metrics.

We provide helper methods in the datastore that forces the caller to consider the metrics staleness (as it's not always obvious), so perhaps having this property in the datastore.go makes more sense.

liu-cong · 2025-07-29T04:04:04Z

pkg/epp/datastore/datastore.go

@@ -63,7 +63,7 @@ type Datastore interface {
 	ModelGetAll() []*v1alpha2.InferenceModel

 	// PodMetrics operations
-	// PodGetAll returns all pods and metrics, including fresh and stale.
+	// PodGetAll returns all pods with stale and fresh metrics, only for testing.


We shouldn't keep a method just for testing (that means we don't need to test it if no one is calling it). I think we can either:

Keep PodGetAll in this PR, and keep the existing calls to this method, and clean up in a separate PR; OR

Replace PodGetAll with PodList(AllPodPredicate)

liu-cong · 2025-07-29T04:12:44Z

pkg/epp/datastore/datastore_test.go

 				},
 			},
+			storePods: []*corev1.Pod{pod1, pod2, pod3},
+			want:      []*backendmetrics.MetricsState{pod1Metrics, pod2Metrics}, // pod3 metrics were stale and should not be included.


I don't think the test setup is correct. This test case simply creates a pod metric client with only pod1 and pod3, and as a result pod3 is not included, but not because its metrics is stale. To correctly test the behavior, I would expect a backendmetrics.MetricsState object is created with different update timestamps.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 23, 2025

k8s-ci-robot requested review from ahg-g and robscott June 23, 2025 12:33

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 23, 2025

nayihz force-pushed the feat_metric_stale_time branch from 12f8bfe to 2d42a53 Compare June 23, 2025 12:35

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2025

nayihz force-pushed the feat_metric_stale_time branch from a339897 to 1005486 Compare June 25, 2025 05:27

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2025

nayihz force-pushed the feat_metric_stale_time branch from 1005486 to 9b1e7e2 Compare June 25, 2025 05:28

nayihz marked this pull request as ready for review June 25, 2025 09:22

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 25, 2025

k8s-ci-robot requested a review from danehans June 25, 2025 09:22

nayihz force-pushed the feat_metric_stale_time branch from 9b1e7e2 to 12943a7 Compare June 25, 2025 09:29

k8s-ci-robot requested a review from liu-cong June 25, 2025 09:48

liu-cong reviewed Jun 26, 2025

View reviewed changes

nayihz force-pushed the feat_metric_stale_time branch from 12943a7 to 518655c Compare June 29, 2025 07:11

nirrozenbaum reviewed Jun 29, 2025

View reviewed changes

pkg/epp/scheduling/scheduler.go Outdated Show resolved Hide resolved

nayihz force-pushed the feat_metric_stale_time branch from 518655c to e54be57 Compare June 29, 2025 13:31

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 30, 2025

nayihz force-pushed the feat_metric_stale_time branch 2 times, most recently from 6bde389 to bff4272 Compare July 2, 2025 02:38

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 2, 2025

nayihz force-pushed the feat_metric_stale_time branch 2 times, most recently from 1fbad64 to 75047b3 Compare July 2, 2025 03:14

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 10, 2025

nayihz requested review from nirrozenbaum and liu-cong July 10, 2025 01:16

nayihz force-pushed the feat_metric_stale_time branch from 75047b3 to dfbfce2 Compare July 24, 2025 11:14

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 24, 2025

nayihz force-pushed the feat_metric_stale_time branch from dfbfce2 to 32ddd96 Compare July 24, 2025 11:20

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2025

nayihz added 2 commits July 29, 2025 10:28

feat: Make metrics stale time configurable

40da1cb

feat: use PodList with predicate func instead of PodGetAll to obtain …

8ca3b92

…only fresh metrics

nayihz force-pushed the feat_metric_stale_time branch from 32ddd96 to 8ca3b92 Compare July 29, 2025 02:38

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 29, 2025

liu-cong reviewed Jul 29, 2025

View reviewed changes

	"are fresh enough to be used for scheduling decisions.")
	"are fresh enough.")

	DefaultMetricsStalenessThreshold = 2 * time.Second // default for --metricsStalenessThreshold
	DefaultMetricsStalenessThreshold = 2 * time.Second

feat: Make metrics stale time configurable #1046

Are you sure you want to change the base?

feat: Make metrics stale time configurable #1046

Conversation

nayihz commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jun 23, 2025

Uh oh!

k8s-ci-robot commented Jun 23, 2025

Uh oh!

netlify bot commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

nayihz commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nayihz commented Jul 1, 2025

Uh oh!

nirrozenbaum commented Jul 1, 2025

Uh oh!

nirrozenbaum commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nayihz commented Jul 2, 2025

Uh oh!

nayihz commented Jul 2, 2025

Uh oh!

kfswain commented Jul 24, 2025

Uh oh!

nayihz commented Jul 24, 2025

Uh oh!

liu-cong commented Jul 28, 2025

Uh oh!

nayihz commented Jul 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nayihz commented Jun 23, 2025 •

edited

Loading

netlify bot commented Jun 23, 2025 •

edited

Loading

nirrozenbaum commented Jul 1, 2025 •

edited

Loading