-
Notifications
You must be signed in to change notification settings - Fork 129
feat: Make metrics stale time configurable #1046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: nayihz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
12f8bfe
to
2d42a53
Compare
a339897
to
1005486
Compare
1005486
to
9b1e7e2
Compare
9b1e7e2
to
12943a7
Compare
/cc @liu-cong |
12943a7
to
518655c
Compare
518655c
to
e54be57
Compare
I found that it becomes very inconvenient to write unit tests after updating |
@nayihz I don’t want to nitpick too much, but to be honest I’m not sure why the interface change was required. |
I mean - to leave PodGetAll function as is.. and use the ListPod with that predicate only in the specific places it’s needed. would that help? |
6bde389
to
bff4272
Compare
make sense to me. |
1fbad64
to
75047b3
Compare
gateway-api-inference-extension/pkg/epp/requestcontrol/director.go Lines 197 to 204 in 2ae867d
Will change d.datastore.PodGetAll to d.datastore.PodList(backendmetrics.FreshMetricsFn) in a follow up because we should refractor the unit test. @nirrozenbaum
|
Heya @nayihz do you mind rebasing? I can take a look tomorrow once this is up to date |
75047b3
to
dfbfce2
Compare
dfbfce2
to
32ddd96
Compare
Sorry for the delay here, finally got some time to rebase this, PTAL. |
@nayihz Thanks again for working on this! My suggestion is to do this in 2 steps (which seems like you are already doing).
Happy to chat on Slack or jump on a call :) |
32ddd96
to
8ca3b92
Compare
Do you mean that implement the step1 and step2 in two separate commits? |
|
||
// metrics related flags | ||
refreshMetricsInterval = flag.Duration( | ||
"refreshMetricsInterval", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls use refresh-metrics-interval
to be consistent, same for the other two metrics related flags.
metricsStalenessThreshold = flag.Duration("metricsStalenessThreshold", | ||
config.DefaultMetricsStalenessThreshold, | ||
"Duration after which metrics are considered stale. This is used to determine if a pod's metrics "+ | ||
"are fresh enough to be used for scheduling decisions.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"are fresh enough to be used for scheduling decisions.") | |
"are fresh enough.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics are not just limited to scheduling. I think just keeping it a bit broad is OK.
// are considered stale. | ||
// The staleness is determined by the refresh internal plus the latency of the metrics API. | ||
// To be on the safer side, we start with a larger threshold. | ||
DefaultMetricsStalenessThreshold = 2 * time.Second // default for --metricsStalenessThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DefaultMetricsStalenessThreshold = 2 * time.Second // default for --metricsStalenessThreshold | |
DefaultMetricsStalenessThreshold = 2 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest removing these comments, they can get out dated easily (e.g., change of flag names). No need to comment as one can easily find the reference
@@ -172,7 +173,9 @@ func diffStore(datastore datastore.Datastore, params diffStoreParams) string { | |||
params.wantPods = []string{} | |||
} | |||
gotPods := []string{} | |||
for _, pm := range datastore.PodGetAll() { | |||
for _, pm := range datastore.PodList(func(backendmetrics.PodMetrics) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you create two "constant" predicates in the datastore like so:
var AllPodPredicate = func(backendmetrics.PodMetrics) bool { return true}
var PodWithFreshMetrics = func(pm backendmetrics.PodMetrics) bool { return time.Since(pm.GetMetrics().UpdateTime) <= pm.GetMetricsStalenessThreshold()}
@@ -62,7 +63,7 @@ func (c *inferencePoolMetricsCollector) Collect(ch chan<- prometheus.Metric) { | |||
return | |||
} | |||
|
|||
podMetrics := c.ds.PodGetAll() | |||
podMetrics := c.ds.PodList(backendmetrics.FreshMetricsFn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This what I meant by the "2 step" approach. Here we will still get all pods, and in a separate PR (step2) we update the callers to only get fresh metrics. Given the non-zero risk of this change, and the large number of files this PR touched, rolling back will be very challenging if we ever need to do that.
@@ -108,7 +102,7 @@ func NewDetector(config *Config, datastore Datastore, logger logr.Logger) *Detec | |||
// (no capacity). | |||
func (d *Detector) IsSaturated(ctx context.Context) bool { | |||
logger := log.FromContext(ctx).WithName(loggerName) | |||
allPodsMetrics := d.datastore.PodGetAll() | |||
allPodsMetrics := d.datastore.PodList(backendmetrics.FreshMetricsFn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest not changing this for now and add a TODO to change this in a separate PR
@@ -80,19 +83,21 @@ const ( | |||
DefaultCertPath = "" // default for --cert-path | |||
DefaultConfigFile = "" // default for --config-file | |||
DefaultConfigText = "" // default for --config-text | |||
DefaultMetricsStalenessThreshold = 200 * time.Millisecond // default for --metricsStalenessThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove all these constants as they are now in the common/config package?
pmc PodMetricsClient | ||
ds Datastore | ||
interval time.Duration | ||
stalenessThreshold time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the stalenessThreshold should be a property of the podMetrics
. Here's my thought:
podMetrics
is a low level implementation that refreshes the metrics. It shouldn't make the decision whether a metrics is fresh or not. And in the current implementation, it's not using the threshold at all except theGet
method.- Ultimately it's the caller to decide whether to use a metrics if it's stale. For example, in the
GetRandomPod
method it may not care about the stale metrics. - We provide helper methods in the
datastore
that forces the caller to consider the metrics staleness (as it's not always obvious), so perhaps having this property in thedatastore.go
makes more sense.
@@ -63,7 +63,7 @@ type Datastore interface { | |||
ModelGetAll() []*v1alpha2.InferenceModel | |||
|
|||
// PodMetrics operations | |||
// PodGetAll returns all pods and metrics, including fresh and stale. | |||
// PodGetAll returns all pods with stale and fresh metrics, only for testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't keep a method just for testing (that means we don't need to test it if no one is calling it). I think we can either:
- Keep PodGetAll in this PR, and keep the existing calls to this method, and clean up in a separate PR; OR
- Replace PodGetAll with
PodList(AllPodPredicate)
}, | ||
}, | ||
storePods: []*corev1.Pod{pod1, pod2, pod3}, | ||
want: []*backendmetrics.MetricsState{pod1Metrics, pod2Metrics}, // pod3 metrics were stale and should not be included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the test setup is correct. This test case simply creates a pod metric client with only pod1 and pod3, and as a result pod3 is not included, but not because its metrics is stale. To correctly test the behavior, I would expect a backendmetrics.MetricsState
object is created with different update timestamps.
fix: #336
changes ref: #336 (comment)