Skip to content

Conversation

NickAnge
Copy link
Contributor

@NickAnge NickAnge commented Sep 24, 2025

Summary

This PR adds an optional failure classification to rule evaluation that allows operators to distinguish between operator-controllable failures (e.g. timeouts, server errors) and client errors.

  • New EvalOperatorControllableFailures metric tracks operator-controllable failures separately.
  • OperatorControllableErrorClassifier interface allows custom failure classification logic.
  • Add 'cause' label at prometheus_rule_evaluation_failures_total to separate between

Check tests for example usage

Which issue(s) does the PR fix:

Part of https://github.com/grafana/mimir-squad/issues/3255

Does this PR introduce a user-facing change?

[FEATURE] Rules: Add optional operator-controllable error classifier for rule evaluation metrics.
[ENHANCEMENT] Rules: Add cause label to prometheus_rule_evaluation_failures_total metric - distinguishes between operator-controllable (cause="operator") and user-controllable (cause="user") rule evaluation failures

@NickAnge NickAnge force-pushed the nickange/ruler/add-optional-failure-claissifier-hook branch 3 times, most recently from 7b05e1d to 3868080 Compare September 24, 2025 16:31
@NickAnge NickAnge force-pushed the nickange/ruler/add-optional-failure-claissifier-hook branch from 3868080 to a02ff2a Compare September 24, 2025 16:47
@NickAnge NickAnge marked this pull request as ready for review September 25, 2025 09:59
rules/group.go Outdated
}

// DefaultEvaluationFailureClassifierFunc is the default implementation of
// EvaluationFailureClassifierFunc that classifies no errors as operator-controllable.
Copy link
Contributor

@chencs chencs Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing that this comment frames tracking operator controllable errors as the intended use case, but then that intention isn't really propagated in naming or any other way (in the end, an operator needs to read this specific code comment to understand the purpose of EvaluationFailureClassifierFunc)

I think it might be better to either:

  • remove the mention here and discuss the concept only in the PR description
  • discuss "keeping track of operator-controllable errors" in code as an example of what an operator might use this function for, rather than framing it as EvaluationFailureClassifierFunc's intended purpose
  • make the naming of the metric/classifier func more explicit

I have a medium-strong preference for the second point over the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @chencs said. my 2c are to keep this simple and go with the third option. we're the only clients of this and have a clear use case, no need to make this overly flexible. in reality there will be exactly one implementation of the EvaluationFailureClassifierFunc

if you decide to go with option 2 then maybe the classifier can return a label value for the metric. e.g. client or server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the 3rd option by making the name more explicit. Tried to keep it simple. Let me know what you think

rules/group.go Outdated
}

// DefaultEvaluationFailureClassifierFunc is the default implementation of
// EvaluationFailureClassifierFunc that classifies no errors as operator-controllable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to what @chencs said. my 2c are to keep this simple and go with the third option. we're the only clients of this and have a clear use case, no need to make this overly flexible. in reality there will be exactly one implementation of the EvaluationFailureClassifierFunc

if you decide to go with option 2 then maybe the classifier can return a label value for the metric. e.g. client or server.

rules/group.go Outdated
Comment on lines 1042 to 1049
EvalFilteredFailures: prometheus.NewCounterVec(
prometheus.CounterOpts{
Namespace: namespace,
Name: "rule_evaluation_filtered_failures_total",
Help: "The total number of rule evaluation failures classified by the failure classifier function.",
},
[]string{"rule_group"},
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you have two metrics isn't never obvious how they relate to each other when you write the queries. Does rule_evaluation_failures_total include rule_evaluation_filtered_failures_total or not? Is rule_evaluation_filtered_failures_total a subset of rule_evaluation_failures_total or not

an easier way would be to add a label to rule_evaluation_failures_total:

rule_evaluation_failures_total{class="client", rule_group="..."}
rule_evaluation_failures_total{class="server", rule_group="..."}

(class or type or something else)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I struggled to find a good label that will make me change the implementation from a different metric to a single metric with two labels.

I didn't want to go to the client/server seperation cause we might include client (4xx) as operator controllable. I thought about (operatorControllable) with values true or false. 🤔 But still not good enough in my opinion. I am open to discuss this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can go with class="internal|validation" or internal="true|false" or reason="operator|user" or cause="operator|user"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you don't like any of those, another metric is also ok. But then can you clarify in the help text of the existing metric and the new one how they relate to each other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions @dimitarvdimitrov . Probably I am gonna go with one of these options. I like not having an extra metric, when we can use the same one. WIll work on this today

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the label cause and by default is user and we can configure which failures are operator controllable with the function. What do you think ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(btw, chatgpt or other chatbots can be very helpful when thinking about naming problems)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree , i didn't like their suggestion 😢

…plicit to the use case

Signed-off-by: Nikos Angelopoulos <[email protected]>
@NickAnge NickAnge force-pushed the nickange/ruler/add-optional-failure-claissifier-hook branch from 18b6fdc to 814e8bf Compare October 7, 2025 12:08
@NickAnge NickAnge force-pushed the nickange/ruler/add-optional-failure-claissifier-hook branch from b060690 to a7f2156 Compare October 8, 2025 10:05
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweet! let's wait for @chencs's approval before merging

@NickAnge NickAnge requested a review from chencs October 9, 2025 08:11
@narqo
Copy link
Contributor

narqo commented Oct 9, 2025

I've just randomly passed by and don't know much context about the work here, sorry.

Just wanted to mention that in grafana/mimir#10536 we worked on, what looks like, a similar problem. There we added the reason=<cause> label to cortex_ruler_queries_failed_total and cortex_ruler_write_requests_failed_total, which Grafana was supposed to show to users in the insights.

Maybe we can be consistent, and use the same naming for the label names here.

@NickAnge
Copy link
Contributor Author

NickAnge commented Oct 9, 2025

Thanks @narqo for the comment. I think it makes total sense. Changed it to reason

@NickAnge NickAnge merged commit 0ed5c29 into main Oct 10, 2025
28 checks passed
@NickAnge NickAnge deleted the nickange/ruler/add-optional-failure-claissifier-hook branch October 10, 2025 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants