-
Notifications
You must be signed in to change notification settings - Fork 11
ruler: add Optional Failure Classifier for Rule Evaluation #988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ruler: add Optional Failure Classifier for Rule Evaluation #988
Conversation
Signed-off-by: Nikos Angelopoulos <[email protected]>
7b05e1d
to
3868080
Compare
3868080
to
a02ff2a
Compare
…etrics Signed-off-by: Nikos Angelopoulos <[email protected]>
rules/group.go
Outdated
} | ||
|
||
// DefaultEvaluationFailureClassifierFunc is the default implementation of | ||
// EvaluationFailureClassifierFunc that classifies no errors as operator-controllable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little confusing that this comment frames tracking operator controllable errors as the intended use case, but then that intention isn't really propagated in naming or any other way (in the end, an operator needs to read this specific code comment to understand the purpose of EvaluationFailureClassifierFunc
)
I think it might be better to either:
- remove the mention here and discuss the concept only in the PR description
- discuss "keeping track of operator-controllable errors" in code as an example of what an operator might use this function for, rather than framing it as
EvaluationFailureClassifierFunc
's intended purpose - make the naming of the metric/classifier func more explicit
I have a medium-strong preference for the second point over the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what @chencs said. my 2c are to keep this simple and go with the third option. we're the only clients of this and have a clear use case, no need to make this overly flexible. in reality there will be exactly one implementation of the EvaluationFailureClassifierFunc
if you decide to go with option 2 then maybe the classifier can return a label value for the metric. e.g. client
or server
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the 3rd option by making the name more explicit. Tried to keep it simple. Let me know what you think
rules/group.go
Outdated
} | ||
|
||
// DefaultEvaluationFailureClassifierFunc is the default implementation of | ||
// EvaluationFailureClassifierFunc that classifies no errors as operator-controllable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what @chencs said. my 2c are to keep this simple and go with the third option. we're the only clients of this and have a clear use case, no need to make this overly flexible. in reality there will be exactly one implementation of the EvaluationFailureClassifierFunc
if you decide to go with option 2 then maybe the classifier can return a label value for the metric. e.g. client
or server
.
rules/group.go
Outdated
EvalFilteredFailures: prometheus.NewCounterVec( | ||
prometheus.CounterOpts{ | ||
Namespace: namespace, | ||
Name: "rule_evaluation_filtered_failures_total", | ||
Help: "The total number of rule evaluation failures classified by the failure classifier function.", | ||
}, | ||
[]string{"rule_group"}, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when you have two metrics isn't never obvious how they relate to each other when you write the queries. Does rule_evaluation_failures_total
include rule_evaluation_filtered_failures_total
or not? Is rule_evaluation_filtered_failures_total
a subset of rule_evaluation_failures_total
or not
an easier way would be to add a label to rule_evaluation_failures_total
:
rule_evaluation_failures_total{class="client", rule_group="..."}
rule_evaluation_failures_total{class="server", rule_group="..."}
(class
or type
or something else)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I struggled to find a good label that will make me change the implementation from a different metric to a single metric with two labels.
I didn't want to go to the client/server seperation cause we might include client (4xx) as operator controllable. I thought about (operatorControllable
) with values true or false. 🤔 But still not good enough in my opinion. I am open to discuss this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can go with class="internal|validation"
or internal="true|false"
or reason="operator|user"
or cause="operator|user"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you don't like any of those, another metric is also ok. But then can you clarify in the help text of the existing metric and the new one how they relate to each other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions @dimitarvdimitrov . Probably I am gonna go with one of these options. I like not having an extra metric, when we can use the same one. WIll work on this today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the label cause
and by default is user and we can configure which failures are operator
controllable with the function. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(btw, chatgpt or other chatbots can be very helpful when thinking about naming problems)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree , i didn't like their suggestion 😢
…plicit to the use case Signed-off-by: Nikos Angelopoulos <[email protected]>
18b6fdc
to
814e8bf
Compare
b060690
to
a7f2156
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet! let's wait for @chencs's approval before merging
I've just randomly passed by and don't know much context about the work here, sorry. Just wanted to mention that in grafana/mimir#10536 we worked on, what looks like, a similar problem. There we added the Maybe we can be consistent, and use the same naming for the label names here. |
Thanks @narqo for the comment. I think it makes total sense. Changed it to |
Summary
This PR adds an optional failure classification to rule evaluation that allows operators to distinguish between operator-controllable failures (e.g. timeouts, server errors) and client errors.
EvalOperatorControllableFailures
metric tracks operator-controllable failures separately.OperatorControllableErrorClassifier
interface allows custom failure classification logic.prometheus_rule_evaluation_failures_total
to separate betweenCheck tests for example usage
Which issue(s) does the PR fix:
Part of https://github.com/grafana/mimir-squad/issues/3255
Does this PR introduce a user-facing change?