Skip to content

fix: Use topic instead of session_id as the Prometheus label#1093

Open
OmCheeLin wants to merge 11 commits intoapache:mainfrom
OmCheeLin:queue-obs-pr1
Open

fix: Use topic instead of session_id as the Prometheus label#1093
OmCheeLin wants to merge 11 commits intoapache:mainfrom
OmCheeLin:queue-obs-pr1

Conversation

@OmCheeLin
Copy link
Copy Markdown
Contributor

@OmCheeLin OmCheeLin commented Apr 26, 2026

Use topic instead of session_id as the Prometheus label on liaison queue_sub chunk-ordering counters to avoid unbounded metric cardinality.

session_id is per-sync UUID and causes unbounded Prometheus cardinality.
Use sync metadata topic at each updateChunkOrderMetrics call site.

Made-with: Cursor
- observability.md: document counters and PromQL grouped by topic
- grafana-cluster.json: row + timeseries for ordering event rates by topic

Made-with: Cursor
@OmCheeLin OmCheeLin marked this pull request as draft April 26, 2026 15:05
@OmCheeLin OmCheeLin marked this pull request as ready for review April 26, 2026 22:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses unbounded Prometheus metric cardinality for liaison queue_sub chunk-ordering counters by switching the label from per-sync session_id to bounded topic, and updates operator-facing docs/dashboards accordingly.

Changes:

  • Update queue_sub chunk-ordering counters to use the topic label instead of session_id.
  • Document the chunk-ordering metrics and provide example PromQL queries.
  • Add a Grafana dashboard row/panel to visualize chunk-ordering counters by topic, and record the change in CHANGES.md.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
banyand/queue/sub/server.go Switch chunk-ordering counters to label by topic; update metric increment helper accordingly.
banyand/queue/sub/chunked_sync.go Emit chunk-ordering metric increments using the session metadata Topic.
docs/operation/observability.md Add a new section describing queue_sub chunk-ordering counters and example PromQL.
docs/operation/grafana-cluster.json Add a Grafana row/panel for chunk-ordering counters grouped by topic.
CHANGES.md Add a bug-fix entry describing the label cardinality change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/operation/observability.md Outdated
|--------|---------|
| `banyandb_queue_sub_out_of_order_chunks_received` | Out-of-order chunk observed (before buffer/gap handling). |
| `banyandb_queue_sub_chunks_buffered` | Chunk stored in reorder buffer. |
| `banyandb_queue_sub_buffer_timeouts` | Reorder buffer timed out waiting for missing chunks. |
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc entry describes the banyandb_queue_sub_buffer_timeouts counter, but the current implementation never increments/emits that metric (no production call sites). Either implement the buffer-timeout metric emission or remove this line to avoid documenting a non-functional signal.

Suggested change
| `banyandb_queue_sub_buffer_timeouts` | Reorder buffer timed out waiting for missing chunks. |

Copilot uses AI. Check for mistakes.
Comment thread docs/operation/grafana-cluster.json Outdated
Comment on lines +2657 to +2669
"expr": "sum(rate(banyandb_queue_sub_buffer_timeouts{job=~\"$job\",pod=~\"$pod\"}[$__rate_interval])) by (topic)",
"hide": false,
"instant": false,
"legendFormat": "buffer_timeout {{topic}}",
"range": true,
"refId": "C"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This panel queries banyandb_queue_sub_buffer_timeouts, but the current code path does not increment/emit that counter (no production call sites for the buffer_timeout event). Consider removing this target or implementing the missing metric emission so the panel is not permanently empty.

Suggested change
"expr": "sum(rate(banyandb_queue_sub_buffer_timeouts{job=~\"$job\",pod=~\"$pod\"}[$__rate_interval])) by (topic)",
"hide": false,
"instant": false,
"legendFormat": "buffer_timeout {{topic}}",
"range": true,
"refId": "C"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve reverted the doc-related changes. I’ll handle the documentation in the final PR.

Comment on lines 437 to +443
// Chunk ordering metrics
outOfOrderChunksReceived: factory.NewCounter("out_of_order_chunks_received", "session_id"),
chunksBuffered: factory.NewCounter("chunks_buffered", "session_id"),
bufferTimeouts: factory.NewCounter("buffer_timeouts", "session_id"),
largeGapsRejected: factory.NewCounter("large_gaps_rejected", "session_id"),
bufferCapacityExceeded: factory.NewCounter("buffer_capacity_exceeded", "session_id"),
finishSyncErr: factory.NewCounter("finish_sync_err", "session_id"),
outOfOrderChunksReceived: factory.NewCounter("out_of_order_chunks_received", "topic"),
chunksBuffered: factory.NewCounter("chunks_buffered", "topic"),
bufferTimeouts: factory.NewCounter("buffer_timeouts", "topic"),
largeGapsRejected: factory.NewCounter("large_gaps_rejected", "topic"),
bufferCapacityExceeded: factory.NewCounter("buffer_capacity_exceeded", "topic"),
finishSyncErr: factory.NewCounter("finish_sync_err", "topic"),
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unit test was added to verify that chunk-ordering counters are labeled by topic (and that session_id is not used), which is the core regression this PR is addressing. Please add a unit test that asserts the emitted metric series contain the topic label only, ideally by recording a chunk-ordering event and inspecting the metrics registry output.

Copilot uses AI. Check for mistakes.
Comment thread banyand/queue/sub/chunked_sync.go Outdated
Comment on lines +269 to +273
if req.ChunkIndex > buffer.expectedIndex {
gap := req.ChunkIndex - buffer.expectedIndex
s.updateChunkOrderMetrics("out_of_order_received", req.SessionId)
if session.metadata != nil {
s.updateChunkOrderMetrics("out_of_order_received", session.metadata.Topic)
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

topic is taken directly from client-provided sync metadata and is used as a Prometheus label before the value is validated/mapped (topic validation currently happens later in processExpectedChunk). To keep metric cardinality bounded even under malformed/hostile requests, consider normalizing the label to a small known set (e.g., map via data.TopicMap and use the canonical bus.Topic string; otherwise use a constant like unknown).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might not need to be changed, since the client is usually internal gRPC within the cluster and is considered trusted.

Comment on lines 455 to 459
case "chunk_buffered":
s.metrics.chunksBuffered.Inc(1, sessionID)
s.metrics.chunksBuffered.Inc(1, topic)
case "buffer_timeout":
s.metrics.bufferTimeouts.Inc(1, sessionID)
s.metrics.bufferTimeouts.Inc(1, topic)
case "gap_too_large":
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer_timeouts counter can be incremented via the buffer_timeout event, but there are no call sites that emit this event (and checkBufferTimeout is never invoked in production code). As a result, banyandb_queue_sub_buffer_timeouts will never be emitted/incremented; either wire buffer-timeout detection to call updateChunkOrderMetrics("buffer_timeout", topic) or remove the unused metric/event to avoid misleading operators.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue already existed before and was not introduced by this PR. I will address and improve it in a follow-up PR.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 39.13043% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.11%. Comparing base (3530dd9) to head (4628f3b).
⚠️ Report is 223 commits behind head on main.

Files with missing lines Patch % Lines
banyand/queue/sub/server.go 7.69% 12 Missing ⚠️
banyand/queue/sub/chunked_sync.go 80.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1093      +/-   ##
==========================================
+ Coverage   45.97%   51.11%   +5.13%     
==========================================
  Files         328      417      +89     
  Lines       55505    68040   +12535     
==========================================
+ Hits        25520    34779    +9259     
- Misses      27909    30294    +2385     
- Partials     2076     2967     +891     
Flag Coverage Δ
banyand 52.47% <39.13%> (?)
bydbctl 82.35% <ø> (?)
fodc 70.47% <ø> (?)
integration-distributed 98.40% <ø> (?)
integration-standalone 97.99% <ø> (?)
pkg 31.03% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@hanahmily hanahmily left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending review — 4 critical issues left as inline comments. Please address before merge.

Comment thread banyand/queue/sub/chunked_sync.go Outdated
if currentSession.partCtx.Handler != nil {
if finishErr := currentSession.partCtx.Handler.FinishSync(); finishErr != nil {
s.updateChunkOrderMetrics("finish_sync_err", currentSession.sessionID)
if currentSession.metadata != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 1 — defensive metadata != nil guard is dead code that silently drops metrics if the invariant ever breaks.

This guard (and the four equivalents added at lines 269, 281, 292, 304 in processChunkWithReordering) is unreachable today: a syncSession is only constructed at line 173 inside the if req.GetMetadata() != nil branch (line 157), so currentSession.metadata is never nil after creation.

The risk is in the future: if a refactor ever lets metadata become nil, this guard silently swallows exactly the chunk-ordering metrics we use to detect bad sync state — no log line, no series, just a missing signal for finish_sync_err / gap_too_large / buffer_full.

Suggest one of:

  • Drop the guard and rely on the documented invariant.
  • Or, fall back to a sentinel topic ("unknown") and emit s.log.Warn(...) so any breach surfaces.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

case "chunk_buffered":
s.metrics.chunksBuffered.Inc(1, sessionID)
s.metrics.chunksBuffered.Inc(1, topic)
case "buffer_timeout":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 2 — "buffer_timeout" event is wired here but never fired anywhere.

$ grep -rn '"buffer_timeout"' --include='*.go' .
banyand/queue/sub/server.go:457   ← only this case statement

checkBufferTimeout (chunked_sync.go:434) is the natural producer but returns the error without recording the metric. Worse, checkBufferTimeout itself only has a test caller (chunked_sync_test.go:280); no production code path appears to invoke it.

Since this PR is exactly about cleaning up queue_sub chunk-ordering observability, please pick one:

  • Wire it up: add s.updateChunkOrderMetrics("buffer_timeout", session.metadata.Topic) inside checkBufferTimeout, and invoke checkBufferTimeout from a real code path (e.g. processChunkWithReordering or a periodic janitor).
  • Delete the dead code: remove the bufferTimeouts field, the factory.NewCounter("buffer_timeouts", …) registration, and this case branch.

Leaving it half-wired ships a buffer_timeouts metric to operators that is silently always zero.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread CHANGES.md

### Bug Fixes

- Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 3 — docs and Grafana panel were ripped out by the revert doc commit (1a7d987), but the PR description still claims they ship.

After upgrade, operators with dashboards or alerts grouped by session_id will silently see empty results, and the orphaned high-cardinality *_session_id series will linger in Prometheus storage until retention expires.

Either restore the observability.md and grafana-cluster.json updates, or extend this changelog entry with an explicit upgrade note (e.g. "operators must update dashboards keyed on session_id; existing per-session series will age out with retention").

Copy link
Copy Markdown
Contributor Author

@OmCheeLin OmCheeLin Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked docs/operation/grafana-cluster.json and searched for session_id and those metric names (like out_of_order_chunks_received, chunks_buffered, buffer_timeouts, etc.), but couldn’t find anything.

In docs/operation/observability.md, there’s only one mention of banyandb_queue_sub_total_msg_sent_err (line 69). There’s no session_id, nor any of the chunk-ordering metrics introduced in this change.

That’s why I reverted the docs for now and plan to include the original metrics as well as the additional metrics introduced later in the final PR.

Comment thread CHANGES.md

### Bug Fixes

- Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue 4 — no regression test added.

The PR template's Add a unit test to verify that the fix works checkbox is still unchecked. The existing fixtures in chunked_sync_test.go (e.g. test-session-1 at line 307, test-session-2 at line 364) already construct sessions with metadata.Topic set, so a small test that wires a mock counter into s.metrics, drives an out-of-order chunk, and asserts that the emitted label equals the topic (and that no per-session series is created) would be straightforward.

Without it, a future revert to req.SessionId would silently restore the cardinality blow-up this PR is meant to prevent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@OmCheeLin OmCheeLin marked this pull request as draft April 27, 2026 12:22
@OmCheeLin OmCheeLin marked this pull request as ready for review April 27, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants