-
Notifications
You must be signed in to change notification settings - Fork 113
fix: Use topic instead of session_id as the Prometheus label
#1093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
3e5eddd
b558758
554e3d8
4628f3b
1a7d987
491cc1a
44ecc1e
5e938b5
94d3239
6fa5c6e
c4d8b3f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -34,6 +34,7 @@ Release Notes. | |
|
|
||
| ### Bug Fixes | ||
|
|
||
| - Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue 4 — no regression test added. The PR template's Without it, a future revert to
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
| - Fix flaky trace query filtering caused by non-deterministic sidx tag ordering and add consistency checks for integration query cases. | ||
| - Fix index-mode measure queries returning documents outside the requested time range when a widened segment overlaps the query window. | ||
| - MCP: Add validation for properties and harden the mcp server. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -435,32 +435,32 @@ func newMetrics(factory observability.Factory) *metrics { | |
| totalMsgSentErr: factory.NewCounter("total_msg_sent_err", "topic"), | ||
|
|
||
| // Chunk ordering metrics | ||
| outOfOrderChunksReceived: factory.NewCounter("out_of_order_chunks_received", "session_id"), | ||
| chunksBuffered: factory.NewCounter("chunks_buffered", "session_id"), | ||
| bufferTimeouts: factory.NewCounter("buffer_timeouts", "session_id"), | ||
| largeGapsRejected: factory.NewCounter("large_gaps_rejected", "session_id"), | ||
| bufferCapacityExceeded: factory.NewCounter("buffer_capacity_exceeded", "session_id"), | ||
| finishSyncErr: factory.NewCounter("finish_sync_err", "session_id"), | ||
| outOfOrderChunksReceived: factory.NewCounter("out_of_order_chunks_received", "topic"), | ||
| chunksBuffered: factory.NewCounter("chunks_buffered", "topic"), | ||
| bufferTimeouts: factory.NewCounter("buffer_timeouts", "topic"), | ||
| largeGapsRejected: factory.NewCounter("large_gaps_rejected", "topic"), | ||
| bufferCapacityExceeded: factory.NewCounter("buffer_capacity_exceeded", "topic"), | ||
| finishSyncErr: factory.NewCounter("finish_sync_err", "topic"), | ||
|
Comment on lines
437
to
+443
|
||
| } | ||
| } | ||
|
|
||
| // updateChunkOrderMetrics updates chunk ordering metrics. | ||
| func (s *server) updateChunkOrderMetrics(event, sessionID string) { | ||
| func (s *server) updateChunkOrderMetrics(event, topic string) { | ||
| if s.metrics == nil { | ||
| return // Skip metrics if not initialized (e.g., during tests) | ||
| } | ||
| switch event { | ||
| case "out_of_order_received": | ||
| s.metrics.outOfOrderChunksReceived.Inc(1, sessionID) | ||
| s.metrics.outOfOrderChunksReceived.Inc(1, topic) | ||
| case "chunk_buffered": | ||
| s.metrics.chunksBuffered.Inc(1, sessionID) | ||
| s.metrics.chunksBuffered.Inc(1, topic) | ||
| case "buffer_timeout": | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue 2 —
Since this PR is exactly about cleaning up
Leaving it half-wired ships a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
| s.metrics.bufferTimeouts.Inc(1, sessionID) | ||
| s.metrics.bufferTimeouts.Inc(1, topic) | ||
| case "gap_too_large": | ||
|
Comment on lines
455
to
459
|
||
| s.metrics.largeGapsRejected.Inc(1, sessionID) | ||
| s.metrics.largeGapsRejected.Inc(1, topic) | ||
| case "buffer_full": | ||
| s.metrics.bufferCapacityExceeded.Inc(1, sessionID) | ||
| s.metrics.bufferCapacityExceeded.Inc(1, topic) | ||
| case "finish_sync_err": | ||
| s.metrics.finishSyncErr.Inc(1, sessionID) | ||
| s.metrics.finishSyncErr.Inc(1, topic) | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue 3 — docs and Grafana panel were ripped out by the
revert doccommit (1a7d987), but the PR description still claims they ship.After upgrade, operators with dashboards or alerts grouped by
session_idwill silently see empty results, and the orphaned high-cardinality*_session_idseries will linger in Prometheus storage until retention expires.Either restore the
observability.mdandgrafana-cluster.jsonupdates, or extend this changelog entry with an explicit upgrade note (e.g. "operators must update dashboards keyed onsession_id; existing per-session series will age out with retention").Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked
docs/operation/grafana-cluster.jsonand searched forsession_idand those metric names (likeout_of_order_chunks_received,chunks_buffered,buffer_timeouts, etc.), but couldn’t find anything.In
docs/operation/observability.md, there’s only one mention ofbanyandb_queue_sub_total_msg_sent_err(line 69). There’s nosession_id, nor any of the chunk-ordering metrics introduced in this change.That’s why I reverted the docs for now and plan to include the original metrics as well as the additional metrics introduced later in the final PR.