ruler: add Optional Failure Classifier for Rule Evaluation #988

NickAnge · 2025-09-24T16:08:03Z

Summary

This PR adds an optional failure classification to rule evaluation that allows operators to distinguish between operator-controllable failures (e.g. timeouts, server errors) and client errors.

New EvalOperatorControllableFailures metric tracks operator-controllable failures separately.
OperatorControllableErrorClassifier interface allows custom failure classification logic.
Add 'cause' label at prometheus_rule_evaluation_failures_total to separate between

Check tests for example usage

Which issue(s) does the PR fix:

Part of https://github.com/grafana/mimir-squad/issues/3255

Does this PR introduce a user-facing change?

[FEATURE] Rules: Add optional operator-controllable error classifier for rule evaluation metrics.
[ENHANCEMENT] Rules: Add cause label to prometheus_rule_evaluation_failures_total metric - distinguishes between operator-controllable (cause="operator") and user-controllable (cause="user") rule evaluation failures

Signed-off-by: Nikos Angelopoulos <[email protected]>

…etrics Signed-off-by: Nikos Angelopoulos <[email protected]>

chencs · 2025-10-03T20:37:12Z

rules/group.go

+}
+
+// DefaultEvaluationFailureClassifierFunc is the default implementation of
+// EvaluationFailureClassifierFunc that classifies no errors as operator-controllable.


It's a little confusing that this comment frames tracking operator controllable errors as the intended use case, but then that intention isn't really propagated in naming or any other way (in the end, an operator needs to read this specific code comment to understand the purpose of EvaluationFailureClassifierFunc)

I think it might be better to either:

remove the mention here and discuss the concept only in the PR description

discuss "keeping track of operator-controllable errors" in code as an example of what an operator might use this function for, rather than framing it as EvaluationFailureClassifierFunc's intended purpose

make the naming of the metric/classifier func more explicit

I have a medium-strong preference for the second point over the others.

+1 to what @chencs said. my 2c are to keep this simple and go with the third option. we're the only clients of this and have a clear use case, no need to make this overly flexible. in reality there will be exactly one implementation of the EvaluationFailureClassifierFunc

if you decide to go with option 2 then maybe the classifier can return a label value for the metric. e.g. client or server.

I went with the 3rd option by making the name more explicit. Tried to keep it simple. Let me know what you think

rules/group.go

dimitarvdimitrov · 2025-10-06T18:36:33Z

rules/group.go

+}
+
+// DefaultEvaluationFailureClassifierFunc is the default implementation of
+// EvaluationFailureClassifierFunc that classifies no errors as operator-controllable.


+1 to what @chencs said. my 2c are to keep this simple and go with the third option. we're the only clients of this and have a clear use case, no need to make this overly flexible. in reality there will be exactly one implementation of the EvaluationFailureClassifierFunc

if you decide to go with option 2 then maybe the classifier can return a label value for the metric. e.g. client or server.

dimitarvdimitrov · 2025-10-06T18:38:34Z

rules/group.go

+		EvalFilteredFailures: prometheus.NewCounterVec(
+			prometheus.CounterOpts{
+				Namespace: namespace,
+				Name:      "rule_evaluation_filtered_failures_total",
+				Help:      "The total number of rule evaluation failures classified by the failure classifier function.",
+			},
+			[]string{"rule_group"},
+		),


when you have two metrics isn't never obvious how they relate to each other when you write the queries. Does rule_evaluation_failures_total include rule_evaluation_filtered_failures_total or not? Is rule_evaluation_filtered_failures_total a subset of rule_evaluation_failures_total or not

an easier way would be to add a label to rule_evaluation_failures_total:

rule_evaluation_failures_total{class="client", rule_group="..."} rule_evaluation_failures_total{class="server", rule_group="..."}

(class or type or something else)

Hmm. I struggled to find a good label that will make me change the implementation from a different metric to a single metric with two labels.

I didn't want to go to the client/server seperation cause we might include client (4xx) as operator controllable. I thought about (operatorControllable) with values true or false. 🤔 But still not good enough in my opinion. I am open to discuss this

you can go with class="internal|validation" or internal="true|false" or reason="operator|user" or cause="operator|user"

if you don't like any of those, another metric is also ok. But then can you clarify in the help text of the existing metric and the new one how they relate to each other?

Thanks for the suggestions @dimitarvdimitrov . Probably I am gonna go with one of these options. I like not having an extra metric, when we can use the same one. WIll work on this today

I added the label cause and by default is user and we can configure which failures are operator controllable with the function. What do you think ?

(btw, chatgpt or other chatbots can be very helpful when thinking about naming problems)

Yes I agree , i didn't like their suggestion 😢

…plicit to the use case Signed-off-by: Nikos Angelopoulos <[email protected]>

dimitarvdimitrov

sweet! let's wait for @chencs's approval before merging

narqo · 2025-10-09T10:58:58Z

I've just randomly passed by and don't know much context about the work here, sorry.

Just wanted to mention that in grafana/mimir#10536 we worked on, what looks like, a similar problem. There we added the reason=<cause> label to cortex_ruler_queries_failed_total and cortex_ruler_write_requests_failed_total, which Grafana was supposed to show to users in the insights.

Maybe we can be consistent, and use the same naming for the label names here.

NickAnge · 2025-10-09T16:34:44Z

Thanks @narqo for the comment. I think it makes total sense. Changed it to reason

ruler: add optional failure classifier hook

02faef8

Signed-off-by: Nikos Angelopoulos <[email protected]>

NickAnge force-pushed the nickange/ruler/add-optional-failure-claissifier-hook branch 3 times, most recently from 7b05e1d to 3868080 Compare September 24, 2025 16:31

ruler: read evaluation test with customFailureClasiffier

a02ff2a

NickAnge force-pushed the nickange/ruler/add-optional-failure-claissifier-hook branch from 3868080 to a02ff2a Compare September 24, 2025 16:47

test(ruler/group): add TestEvalDiscardedSamplesDoNotIncrementFailureM…

951a457

…etrics Signed-off-by: Nikos Angelopoulos <[email protected]>

NickAnge marked this pull request as ready for review September 25, 2025 09:59

NickAnge requested review from dimitarvdimitrov and gotjosh September 26, 2025 07:46

chencs reviewed Oct 3, 2025

View reviewed changes

rules/group.go Outdated Show resolved Hide resolved

dimitarvdimitrov reviewed Oct 6, 2025

View reviewed changes

refactor: use an interface for the classifier and make naming more ex…

814e8bf

…plicit to the use case Signed-off-by: Nikos Angelopoulos <[email protected]>

NickAnge force-pushed the nickange/ruler/add-optional-failure-claissifier-hook branch from 18b6fdc to 814e8bf Compare October 7, 2025 12:08

refactor: switch from a smetric to label cause

a7f2156

NickAnge force-pushed the nickange/ruler/add-optional-failure-claissifier-hook branch from b060690 to a7f2156 Compare October 8, 2025 10:05

dimitarvdimitrov approved these changes Oct 8, 2025

View reviewed changes

NickAnge requested a review from chencs October 9, 2025 08:11

change label name to reason

1738ff1

NickAnge merged commit 0ed5c29 into main Oct 10, 2025
28 checks passed

NickAnge deleted the nickange/ruler/add-optional-failure-claissifier-hook branch October 10, 2025 07:39

fionaliao mentioned this pull request Oct 10, 2025

[main] Update mimir-prometheus to 34fc47de9c28 grafana/mimir#12971

Merged

ruler: add Optional Failure Classifier for Rule Evaluation #988

ruler: add Optional Failure Classifier for Rule Evaluation #988

Uh oh!

Conversation

NickAnge commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Which issue(s) does the PR fix:

Does this PR introduce a user-facing change?

Uh oh!

chencs Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimitarvdimitrov left a comment

Choose a reason for hiding this comment

Uh oh!

narqo commented Oct 9, 2025

Uh oh!

NickAnge commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NickAnge commented Sep 24, 2025 •

edited

Loading

chencs Oct 3, 2025 •

edited

Loading

NickAnge commented Oct 9, 2025 •

edited

Loading