fix: Use topic instead of session_id as the Prometheus label#1093
fix: Use topic instead of session_id as the Prometheus label#1093OmCheeLin wants to merge 11 commits intoapache:mainfrom
topic instead of session_id as the Prometheus label#1093Conversation
session_id is per-sync UUID and causes unbounded Prometheus cardinality. Use sync metadata topic at each updateChunkOrderMetrics call site. Made-with: Cursor
- observability.md: document counters and PromQL grouped by topic - grafana-cluster.json: row + timeseries for ordering event rates by topic Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
This PR addresses unbounded Prometheus metric cardinality for liaison queue_sub chunk-ordering counters by switching the label from per-sync session_id to bounded topic, and updates operator-facing docs/dashboards accordingly.
Changes:
- Update
queue_subchunk-ordering counters to use thetopiclabel instead ofsession_id. - Document the chunk-ordering metrics and provide example PromQL queries.
- Add a Grafana dashboard row/panel to visualize chunk-ordering counters by
topic, and record the change inCHANGES.md.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
banyand/queue/sub/server.go |
Switch chunk-ordering counters to label by topic; update metric increment helper accordingly. |
banyand/queue/sub/chunked_sync.go |
Emit chunk-ordering metric increments using the session metadata Topic. |
docs/operation/observability.md |
Add a new section describing queue_sub chunk-ordering counters and example PromQL. |
docs/operation/grafana-cluster.json |
Add a Grafana row/panel for chunk-ordering counters grouped by topic. |
CHANGES.md |
Add a bug-fix entry describing the label cardinality change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| |--------|---------| | ||
| | `banyandb_queue_sub_out_of_order_chunks_received` | Out-of-order chunk observed (before buffer/gap handling). | | ||
| | `banyandb_queue_sub_chunks_buffered` | Chunk stored in reorder buffer. | | ||
| | `banyandb_queue_sub_buffer_timeouts` | Reorder buffer timed out waiting for missing chunks. | |
There was a problem hiding this comment.
This doc entry describes the banyandb_queue_sub_buffer_timeouts counter, but the current implementation never increments/emits that metric (no production call sites). Either implement the buffer-timeout metric emission or remove this line to avoid documenting a non-functional signal.
| | `banyandb_queue_sub_buffer_timeouts` | Reorder buffer timed out waiting for missing chunks. | |
| "expr": "sum(rate(banyandb_queue_sub_buffer_timeouts{job=~\"$job\",pod=~\"$pod\"}[$__rate_interval])) by (topic)", | ||
| "hide": false, | ||
| "instant": false, | ||
| "legendFormat": "buffer_timeout {{topic}}", | ||
| "range": true, | ||
| "refId": "C" | ||
| }, | ||
| { | ||
| "datasource": { | ||
| "type": "prometheus", | ||
| "uid": "${DS_PROMETHEUS}" | ||
| }, | ||
| "editorMode": "code", |
There was a problem hiding this comment.
This panel queries banyandb_queue_sub_buffer_timeouts, but the current code path does not increment/emit that counter (no production call sites for the buffer_timeout event). Consider removing this target or implementing the missing metric emission so the panel is not permanently empty.
| "expr": "sum(rate(banyandb_queue_sub_buffer_timeouts{job=~\"$job\",pod=~\"$pod\"}[$__rate_interval])) by (topic)", | |
| "hide": false, | |
| "instant": false, | |
| "legendFormat": "buffer_timeout {{topic}}", | |
| "range": true, | |
| "refId": "C" | |
| }, | |
| { | |
| "datasource": { | |
| "type": "prometheus", | |
| "uid": "${DS_PROMETHEUS}" | |
| }, | |
| "editorMode": "code", |
There was a problem hiding this comment.
I’ve reverted the doc-related changes. I’ll handle the documentation in the final PR.
| // Chunk ordering metrics | ||
| outOfOrderChunksReceived: factory.NewCounter("out_of_order_chunks_received", "session_id"), | ||
| chunksBuffered: factory.NewCounter("chunks_buffered", "session_id"), | ||
| bufferTimeouts: factory.NewCounter("buffer_timeouts", "session_id"), | ||
| largeGapsRejected: factory.NewCounter("large_gaps_rejected", "session_id"), | ||
| bufferCapacityExceeded: factory.NewCounter("buffer_capacity_exceeded", "session_id"), | ||
| finishSyncErr: factory.NewCounter("finish_sync_err", "session_id"), | ||
| outOfOrderChunksReceived: factory.NewCounter("out_of_order_chunks_received", "topic"), | ||
| chunksBuffered: factory.NewCounter("chunks_buffered", "topic"), | ||
| bufferTimeouts: factory.NewCounter("buffer_timeouts", "topic"), | ||
| largeGapsRejected: factory.NewCounter("large_gaps_rejected", "topic"), | ||
| bufferCapacityExceeded: factory.NewCounter("buffer_capacity_exceeded", "topic"), | ||
| finishSyncErr: factory.NewCounter("finish_sync_err", "topic"), |
There was a problem hiding this comment.
No unit test was added to verify that chunk-ordering counters are labeled by topic (and that session_id is not used), which is the core regression this PR is addressing. Please add a unit test that asserts the emitted metric series contain the topic label only, ideally by recording a chunk-ordering event and inspecting the metrics registry output.
| if req.ChunkIndex > buffer.expectedIndex { | ||
| gap := req.ChunkIndex - buffer.expectedIndex | ||
| s.updateChunkOrderMetrics("out_of_order_received", req.SessionId) | ||
| if session.metadata != nil { | ||
| s.updateChunkOrderMetrics("out_of_order_received", session.metadata.Topic) | ||
| } |
There was a problem hiding this comment.
topic is taken directly from client-provided sync metadata and is used as a Prometheus label before the value is validated/mapped (topic validation currently happens later in processExpectedChunk). To keep metric cardinality bounded even under malformed/hostile requests, consider normalizing the label to a small known set (e.g., map via data.TopicMap and use the canonical bus.Topic string; otherwise use a constant like unknown).
There was a problem hiding this comment.
I think this might not need to be changed, since the client is usually internal gRPC within the cluster and is considered trusted.
| case "chunk_buffered": | ||
| s.metrics.chunksBuffered.Inc(1, sessionID) | ||
| s.metrics.chunksBuffered.Inc(1, topic) | ||
| case "buffer_timeout": | ||
| s.metrics.bufferTimeouts.Inc(1, sessionID) | ||
| s.metrics.bufferTimeouts.Inc(1, topic) | ||
| case "gap_too_large": |
There was a problem hiding this comment.
The buffer_timeouts counter can be incremented via the buffer_timeout event, but there are no call sites that emit this event (and checkBufferTimeout is never invoked in production code). As a result, banyandb_queue_sub_buffer_timeouts will never be emitted/incremented; either wire buffer-timeout detection to call updateChunkOrderMetrics("buffer_timeout", topic) or remove the unused metric/event to avoid misleading operators.
There was a problem hiding this comment.
This issue already existed before and was not introduced by this PR. I will address and improve it in a follow-up PR.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1093 +/- ##
==========================================
+ Coverage 45.97% 51.11% +5.13%
==========================================
Files 328 417 +89
Lines 55505 68040 +12535
==========================================
+ Hits 25520 34779 +9259
- Misses 27909 30294 +2385
- Partials 2076 2967 +891
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…-banyandb into queue-obs-pr1
hanahmily
left a comment
There was a problem hiding this comment.
Pending review — 4 critical issues left as inline comments. Please address before merge.
| if currentSession.partCtx.Handler != nil { | ||
| if finishErr := currentSession.partCtx.Handler.FinishSync(); finishErr != nil { | ||
| s.updateChunkOrderMetrics("finish_sync_err", currentSession.sessionID) | ||
| if currentSession.metadata != nil { |
There was a problem hiding this comment.
Issue 1 — defensive metadata != nil guard is dead code that silently drops metrics if the invariant ever breaks.
This guard (and the four equivalents added at lines 269, 281, 292, 304 in processChunkWithReordering) is unreachable today: a syncSession is only constructed at line 173 inside the if req.GetMetadata() != nil branch (line 157), so currentSession.metadata is never nil after creation.
The risk is in the future: if a refactor ever lets metadata become nil, this guard silently swallows exactly the chunk-ordering metrics we use to detect bad sync state — no log line, no series, just a missing signal for finish_sync_err / gap_too_large / buffer_full.
Suggest one of:
- Drop the guard and rely on the documented invariant.
- Or, fall back to a sentinel topic (
"unknown") and emits.log.Warn(...)so any breach surfaces.
| case "chunk_buffered": | ||
| s.metrics.chunksBuffered.Inc(1, sessionID) | ||
| s.metrics.chunksBuffered.Inc(1, topic) | ||
| case "buffer_timeout": |
There was a problem hiding this comment.
Issue 2 — "buffer_timeout" event is wired here but never fired anywhere.
$ grep -rn '"buffer_timeout"' --include='*.go' .
banyand/queue/sub/server.go:457 ← only this case statement
checkBufferTimeout (chunked_sync.go:434) is the natural producer but returns the error without recording the metric. Worse, checkBufferTimeout itself only has a test caller (chunked_sync_test.go:280); no production code path appears to invoke it.
Since this PR is exactly about cleaning up queue_sub chunk-ordering observability, please pick one:
- Wire it up: add
s.updateChunkOrderMetrics("buffer_timeout", session.metadata.Topic)insidecheckBufferTimeout, and invokecheckBufferTimeoutfrom a real code path (e.g.processChunkWithReorderingor a periodic janitor). - Delete the dead code: remove the
bufferTimeoutsfield, thefactory.NewCounter("buffer_timeouts", …)registration, and thiscasebranch.
Leaving it half-wired ships a buffer_timeouts metric to operators that is silently always zero.
|
|
||
| ### Bug Fixes | ||
|
|
||
| - Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality. |
There was a problem hiding this comment.
Issue 3 — docs and Grafana panel were ripped out by the revert doc commit (1a7d987), but the PR description still claims they ship.
After upgrade, operators with dashboards or alerts grouped by session_id will silently see empty results, and the orphaned high-cardinality *_session_id series will linger in Prometheus storage until retention expires.
Either restore the observability.md and grafana-cluster.json updates, or extend this changelog entry with an explicit upgrade note (e.g. "operators must update dashboards keyed on session_id; existing per-session series will age out with retention").
There was a problem hiding this comment.
I checked docs/operation/grafana-cluster.json and searched for session_id and those metric names (like out_of_order_chunks_received, chunks_buffered, buffer_timeouts, etc.), but couldn’t find anything.
In docs/operation/observability.md, there’s only one mention of banyandb_queue_sub_total_msg_sent_err (line 69). There’s no session_id, nor any of the chunk-ordering metrics introduced in this change.
That’s why I reverted the docs for now and plan to include the original metrics as well as the additional metrics introduced later in the final PR.
|
|
||
| ### Bug Fixes | ||
|
|
||
| - Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality. |
There was a problem hiding this comment.
Issue 4 — no regression test added.
The PR template's Add a unit test to verify that the fix works checkbox is still unchecked. The existing fixtures in chunked_sync_test.go (e.g. test-session-1 at line 307, test-session-2 at line 364) already construct sessions with metadata.Topic set, so a small test that wires a mock counter into s.metrics, drives an out-of-order chunk, and asserts that the emitted label equals the topic (and that no per-session series is created) would be straightforward.
Without it, a future revert to req.SessionId would silently restore the cardinality blow-up this PR is meant to prevent.
Use
topicinstead ofsession_idas the Prometheus label on liaisonqueue_subchunk-ordering counters to avoid unbounded metric cardinality.CHANGESlog.