fix: fix response cache fetch error metric #8644

carodewig · 2025-11-20T18:07:30Z

We noticed that the apollo.router.operations.response_cache.fetch.error metric was out of sync with the apollo.router.cache.redis.errors metric, because errors were not being returned from the Redis client wrapper.

This PR changes the response caching plugin to increment the error metric as expected.

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

Exceptions

Note any exceptions here

Notes

It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩

apollo-librarian · 2025-11-20T18:07:37Z

✅ Docs preview has no changes

The preview was not built because there were no changes.

Build ID: 108ab82db7c67010ce1b9bb1
Build Logs: View logs

…observes timeouts

bnjjj · 2025-11-21T06:40:15Z

apollo-router/src/cache/redis.rs

+
+        values.iter().for_each(|value| {
+            if let Err(err) = value {
+                self.record_error(err)


I think we should return a Result<Vec<Option<RedisValue>>> because it reflects more the behavior we have. As we're doing a mget we have 1 error coming from this command not several ones. In our metrics we would see spikes of errors but at the end it wound just be 1 error so that could be confused, it would not be sync with redis errors metric I think. What do you think ?

As we're doing a mget we have 1 error coming from this command not several ones

@bnjjj This statement is accurate when running against Redis, but not Redis cluster. Because of the Redis sharding, executing an MGET against a cluster effectively results in running a separate MGET per key (due to the 16384 hashslot thing).
Also due to Redis cluster, it's possible to have a partial success - ie if one node is temporarily down, the MGETs can still succeed against the other nodes in the cluster.

I'm happy to change the apollo.router.cache.redis.errors metric to avoid double-counting, but we'll still have the issue of the response caching error metric being out-of-sync with this metric. And if we switch to returning Result<Vec<Option>> we either (a) don't support returning partial success responses or (b) lose any errors because we'll just return Ok([None, Some(), None, ...]).

I'm not thrilled with any of the solutions, but figured that returning Vec<Result> was the least bad option - it allows us to consistently count errors between metrics (even if the numbers are slightly inflated) and represent partial successes.

Maybe the best solution is to keep a Vec<Result<...>> but only increment properly the metric once if it's not redis cluster ? What do you think ?

I've been thinking about this some more in the context of the redis gateway. I'd really like to keep the redis internals separate - ie the response cache plugin shouldn't know whether this is redis cluster, sentinel, etc.

I wonder if the better outcome here would be to make the response_cache.fetch.errors metric intentionally different than the redis.errors metric. response_cache.fetch.errors would increment when there is an error in any of the fetches, but wouldn't be related to the number of errors within the Redis connection.

I think this would still involve returning Vec<Result<...>>, but there would be two changes to the metrics in this PR:

I'd go back to not incrementing redis.errors on each individual item, and just do it on the actual errors

response_cache.fetch.errors would increment by one if results.any(|r| r.is_err())

fix: return Results from MGETs to be able to report errors in callers

b11ca90

This comment has been minimized.

Sign in to view

carodewig added 2 commits November 20, 2025 16:59

test: check that apollo.router.operations.response_cache.fetch.error …

f1f4440

…observes timeouts

doc: changeset

ee3ead3

carodewig marked this pull request as ready for review November 20, 2025 22:43

carodewig requested review from a team as code owners November 20, 2025 22:43

Merge branch 'dev' into caroline/propagate-mget-errors

ee0203d

bnjjj requested changes Nov 21, 2025

View reviewed changes

abernix assigned carodewig Nov 24, 2025

carodewig added 4 commits November 25, 2025 14:05

Merge branch 'dev' into caroline/propagate-mget-errors

fab98a0

fmt: reformat after merge

114f98c

test: fix results after merge

83b43cf

test: improve test reliability

76c12f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix response cache fetch error metric #8644

fix: fix response cache fetch error metric #8644

carodewig commented Nov 20, 2025 •

edited

Loading

Uh oh!

apollo-librarian bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

This comment has been minimized.

bnjjj Nov 21, 2025

Uh oh!

carodewig Nov 24, 2025 •

edited

Loading

Uh oh!

bnjjj Nov 26, 2025

Uh oh!

carodewig Nov 26, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: fix response cache fetch error metric #8644

Are you sure you want to change the base?

fix: fix response cache fetch error metric #8644

Conversation

carodewig commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

apollo-librarian bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Docs preview has no changes

Uh oh!

This comment has been minimized.

bnjjj Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

carodewig Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bnjjj Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

carodewig Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

carodewig commented Nov 20, 2025 •

edited

Loading

apollo-librarian bot commented Nov 20, 2025 •

edited

Loading

carodewig Nov 24, 2025 •

edited

Loading

carodewig Nov 26, 2025 •

edited

Loading