Skip to content

Conversation

@carodewig
Copy link
Contributor

@carodewig carodewig commented Nov 20, 2025

We noticed that the apollo.router.operations.response_cache.fetch.error metric was out of sync with the apollo.router.cache.redis.errors metric, because errors were not being returned from the Redis client wrapper.

This PR changes the response caching plugin to increment the error metric as expected.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • PR description explains the motivation for the change and relevant context for reviewing
  • PR description links appropriate GitHub/Jira tickets (creating when necessary)
  • Changeset is included for user-facing changes
  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Metrics and logs are added3 and documented
  • Tests added and passing4
    • Unit tests
    • Integration tests
    • Manual tests, as necessary

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. A lot of (if not most) features benefit from built-in observability and debug-level logs. Please read this guidance on metrics best-practices.

  4. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@apollo-librarian
Copy link

apollo-librarian bot commented Nov 20, 2025

✅ Docs preview has no changes

The preview was not built because there were no changes.

Build ID: 108ab82db7c67010ce1b9bb1
Build Logs: View logs

@github-actions

This comment has been minimized.

@carodewig carodewig marked this pull request as ready for review November 20, 2025 22:43
@carodewig carodewig requested review from a team as code owners November 20, 2025 22:43

values.iter().for_each(|value| {
if let Err(err) = value {
self.record_error(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return a Result<Vec<Option<RedisValue>>> because it reflects more the behavior we have. As we're doing a mget we have 1 error coming from this command not several ones. In our metrics we would see spikes of errors but at the end it wound just be 1 error so that could be confused, it would not be sync with redis errors metric I think. What do you think ?

Copy link
Contributor Author

@carodewig carodewig Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we're doing a mget we have 1 error coming from this command not several ones

@bnjjj This statement is accurate when running against Redis, but not Redis cluster. Because of the Redis sharding, executing an MGET against a cluster effectively results in running a separate MGET per key (due to the 16384 hashslot thing).
Also due to Redis cluster, it's possible to have a partial success - ie if one node is temporarily down, the MGETs can still succeed against the other nodes in the cluster.

I'm happy to change the apollo.router.cache.redis.errors metric to avoid double-counting, but we'll still have the issue of the response caching error metric being out-of-sync with this metric. And if we switch to returning Result<Vec<Option>> we either (a) don't support returning partial success responses or (b) lose any errors because we'll just return Ok([None, Some(), None, ...]).

I'm not thrilled with any of the solutions, but figured that returning Vec<Result> was the least bad option - it allows us to consistently count errors between metrics (even if the numbers are slightly inflated) and represent partial successes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the best solution is to keep a Vec<Result<...>> but only increment properly the metric once if it's not redis cluster ? What do you think ?

Copy link
Contributor Author

@carodewig carodewig Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been thinking about this some more in the context of the redis gateway. I'd really like to keep the redis internals separate - ie the response cache plugin shouldn't know whether this is redis cluster, sentinel, etc.

I wonder if the better outcome here would be to make the response_cache.fetch.errors metric intentionally different than the redis.errors metric. response_cache.fetch.errors would increment when there is an error in any of the fetches, but wouldn't be related to the number of errors within the Redis connection.

I think this would still involve returning Vec<Result<...>>, but there would be two changes to the metrics in this PR:

  1. I'd go back to not incrementing redis.errors on each individual item, and just do it on the actual errors
  2. response_cache.fetch.errors would increment by one if results.any(|r| r.is_err())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants