-
Notifications
You must be signed in to change notification settings - Fork 324
fix: fix response cache fetch error metric #8644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
✅ Docs preview has no changesThe preview was not built because there were no changes. Build ID: 108ab82db7c67010ce1b9bb1 |
This comment has been minimized.
This comment has been minimized.
|
|
||
| values.iter().for_each(|value| { | ||
| if let Err(err) = value { | ||
| self.record_error(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should return a Result<Vec<Option<RedisValue>>> because it reflects more the behavior we have. As we're doing a mget we have 1 error coming from this command not several ones. In our metrics we would see spikes of errors but at the end it wound just be 1 error so that could be confused, it would not be sync with redis errors metric I think. What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we're doing a mget we have 1 error coming from this command not several ones
@bnjjj This statement is accurate when running against Redis, but not Redis cluster. Because of the Redis sharding, executing an MGET against a cluster effectively results in running a separate MGET per key (due to the 16384 hashslot thing).
Also due to Redis cluster, it's possible to have a partial success - ie if one node is temporarily down, the MGETs can still succeed against the other nodes in the cluster.
I'm happy to change the apollo.router.cache.redis.errors metric to avoid double-counting, but we'll still have the issue of the response caching error metric being out-of-sync with this metric. And if we switch to returning Result<Vec<Option>> we either (a) don't support returning partial success responses or (b) lose any errors because we'll just return Ok([None, Some(), None, ...]).
I'm not thrilled with any of the solutions, but figured that returning Vec<Result> was the least bad option - it allows us to consistently count errors between metrics (even if the numbers are slightly inflated) and represent partial successes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the best solution is to keep a Vec<Result<...>> but only increment properly the metric once if it's not redis cluster ? What do you think ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been thinking about this some more in the context of the redis gateway. I'd really like to keep the redis internals separate - ie the response cache plugin shouldn't know whether this is redis cluster, sentinel, etc.
I wonder if the better outcome here would be to make the response_cache.fetch.errors metric intentionally different than the redis.errors metric. response_cache.fetch.errors would increment when there is an error in any of the fetches, but wouldn't be related to the number of errors within the Redis connection.
I think this would still involve returning Vec<Result<...>>, but there would be two changes to the metrics in this PR:
- I'd go back to not incrementing
redis.errorson each individual item, and just do it on the actual errors response_cache.fetch.errorswould increment by one ifresults.any(|r| r.is_err())
We noticed that the
apollo.router.operations.response_cache.fetch.errormetric was out of sync with theapollo.router.cache.redis.errorsmetric, because errors were not being returned from the Redis client wrapper.This PR changes the response caching plugin to increment the error metric as expected.
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
A lot of (if not most) features benefit from built-in observability and
debug-level logs. Please read this guidance on metrics best-practices. ↩Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩