fix: Use `topic` instead of `session_id` as the Prometheus label by OmCheeLin · Pull Request #1093 · apache/skywalking-banyandb

OmCheeLin · 2026-04-26T15:02:40Z

Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality.

Add a unit test to verify that the fix works.
Explain briefly why the bug exists and how to fix it.
If this pull request closes/resolves/fixes an existing issue, replace the issue number. Part of Add queue observability for liaison internal pipelines skywalking#13775.
Update the CHANGES log.

session_id is per-sync UUID and causes unbounded Prometheus cardinality. Use sync metadata topic at each updateChunkOrderMetrics call site. Made-with: Cursor

- observability.md: document counters and PromQL grouped by topic - grafana-cluster.json: row + timeseries for ordering event rates by topic Made-with: Cursor

Copilot

Pull request overview

This PR addresses unbounded Prometheus metric cardinality for liaison queue_sub chunk-ordering counters by switching the label from per-sync session_id to bounded topic, and updates operator-facing docs/dashboards accordingly.

Changes:

Update queue_sub chunk-ordering counters to use the topic label instead of session_id.
Document the chunk-ordering metrics and provide example PromQL queries.
Add a Grafana dashboard row/panel to visualize chunk-ordering counters by topic, and record the change in CHANGES.md.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`banyand/queue/sub/server.go`	Switch chunk-ordering counters to label by `topic`; update metric increment helper accordingly.
`banyand/queue/sub/chunked_sync.go`	Emit chunk-ordering metric increments using the session metadata `Topic`.
`docs/operation/observability.md`	Add a new section describing `queue_sub` chunk-ordering counters and example PromQL.
`docs/operation/grafana-cluster.json`	Add a Grafana row/panel for chunk-ordering counters grouped by `topic`.
`CHANGES.md`	Add a bug-fix entry describing the label cardinality change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-27T03:56:02Z

+|--------|---------|
+| `banyandb_queue_sub_out_of_order_chunks_received` | Out-of-order chunk observed (before buffer/gap handling). |
+| `banyandb_queue_sub_chunks_buffered` | Chunk stored in reorder buffer. |
+| `banyandb_queue_sub_buffer_timeouts` | Reorder buffer timed out waiting for missing chunks. |


This doc entry describes the banyandb_queue_sub_buffer_timeouts counter, but the current implementation never increments/emits that metric (no production call sites). Either implement the buffer-timeout metric emission or remove this line to avoid documenting a non-functional signal.

Suggested change

| `banyandb_queue_sub_buffer_timeouts` | Reorder buffer timed out waiting for missing chunks. |

Copilot · 2026-04-27T03:56:02Z

+          "expr": "sum(rate(banyandb_queue_sub_buffer_timeouts{job=~\"$job\",pod=~\"$pod\"}[$__rate_interval])) by (topic)",
+          "hide": false,
+          "instant": false,
+          "legendFormat": "buffer_timeout {{topic}}",
+          "range": true,
+          "refId": "C"
+        },
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "${DS_PROMETHEUS}"
+          },
+          "editorMode": "code",


This panel queries banyandb_queue_sub_buffer_timeouts, but the current code path does not increment/emit that counter (no production call sites for the buffer_timeout event). Consider removing this target or implementing the missing metric emission so the panel is not permanently empty.

Suggested change

"expr": "sum(rate(banyandb_queue_sub_buffer_timeouts{job=~\"$job\",pod=~\"$pod\"}[$__rate_interval])) by (topic)",

"hide": false,

"instant": false,

"legendFormat": "buffer_timeout {{topic}}",

"range": true,

"refId": "C"

},

{

"datasource": {

"type": "prometheus",

"uid": "${DS_PROMETHEUS}"

},

"editorMode": "code",

I’ve reverted the doc-related changes. I’ll handle the documentation in the final PR.

Copilot · 2026-04-27T03:56:03Z

 		// Chunk ordering metrics
-		outOfOrderChunksReceived: factory.NewCounter("out_of_order_chunks_received", "session_id"),
-		chunksBuffered:           factory.NewCounter("chunks_buffered", "session_id"),
-		bufferTimeouts:           factory.NewCounter("buffer_timeouts", "session_id"),
-		largeGapsRejected:        factory.NewCounter("large_gaps_rejected", "session_id"),
-		bufferCapacityExceeded:   factory.NewCounter("buffer_capacity_exceeded", "session_id"),
-		finishSyncErr:            factory.NewCounter("finish_sync_err", "session_id"),
+		outOfOrderChunksReceived: factory.NewCounter("out_of_order_chunks_received", "topic"),
+		chunksBuffered:           factory.NewCounter("chunks_buffered", "topic"),
+		bufferTimeouts:           factory.NewCounter("buffer_timeouts", "topic"),
+		largeGapsRejected:        factory.NewCounter("large_gaps_rejected", "topic"),
+		bufferCapacityExceeded:   factory.NewCounter("buffer_capacity_exceeded", "topic"),
+		finishSyncErr:            factory.NewCounter("finish_sync_err", "topic"),


No unit test was added to verify that chunk-ordering counters are labeled by topic (and that session_id is not used), which is the core regression this PR is addressing. Please add a unit test that asserts the emitted metric series contain the topic label only, ideally by recording a chunk-ordering event and inspecting the metrics registry output.

Copilot · 2026-04-27T03:56:03Z

 	if req.ChunkIndex > buffer.expectedIndex {
 		gap := req.ChunkIndex - buffer.expectedIndex
-		s.updateChunkOrderMetrics("out_of_order_received", req.SessionId)
+		if session.metadata != nil {
+			s.updateChunkOrderMetrics("out_of_order_received", session.metadata.Topic)
+		}


topic is taken directly from client-provided sync metadata and is used as a Prometheus label before the value is validated/mapped (topic validation currently happens later in processExpectedChunk). To keep metric cardinality bounded even under malformed/hostile requests, consider normalizing the label to a small known set (e.g., map via data.TopicMap and use the canonical bus.Topic string; otherwise use a constant like unknown).

I think this might not need to be changed, since the client is usually internal gRPC within the cluster and is considered trusted.

Copilot · 2026-04-27T03:56:03Z

 	case "chunk_buffered":
-		s.metrics.chunksBuffered.Inc(1, sessionID)
+		s.metrics.chunksBuffered.Inc(1, topic)
 	case "buffer_timeout":
-		s.metrics.bufferTimeouts.Inc(1, sessionID)
+		s.metrics.bufferTimeouts.Inc(1, topic)
 	case "gap_too_large":


The buffer_timeouts counter can be incremented via the buffer_timeout event, but there are no call sites that emit this event (and checkBufferTimeout is never invoked in production code). As a result, banyandb_queue_sub_buffer_timeouts will never be emitted/incremented; either wire buffer-timeout detection to call updateChunkOrderMetrics("buffer_timeout", topic) or remove the unused metric/event to avoid misleading operators.

This issue already existed before and was not introduced by this PR. I will address and improve it in a follow-up PR.

codecov-commenter · 2026-04-27T04:04:09Z

Codecov Report

❌ Patch coverage is 39.13043% with 14 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.11%. Comparing base (3530dd9) to head (4628f3b).
⚠️ Report is 223 commits behind head on main.

Files with missing lines	Patch %	Lines
banyand/queue/sub/server.go	7.69%	12 Missing ⚠️
banyand/queue/sub/chunked_sync.go	80.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1093      +/-   ##
==========================================
+ Coverage   45.97%   51.11%   +5.13%     
==========================================
  Files         328      417      +89     
  Lines       55505    68040   +12535     
==========================================
+ Hits        25520    34779    +9259     
- Misses      27909    30294    +2385     
- Partials     2076     2967     +891

Flag	Coverage Δ
banyand	`52.47% <39.13%> (?)`
bydbctl	`82.35% <ø> (?)`
fodc	`70.47% <ø> (?)`
integration-distributed	`98.40% <ø> (?)`
integration-standalone	`97.99% <ø> (?)`
pkg	`31.03% <ø> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…-banyandb into queue-obs-pr1

hanahmily

Pending review — 4 critical issues left as inline comments. Please address before merge.

hanahmily · 2026-04-27T10:09:10Z

 					if currentSession.partCtx.Handler != nil {
 						if finishErr := currentSession.partCtx.Handler.FinishSync(); finishErr != nil {
-							s.updateChunkOrderMetrics("finish_sync_err", currentSession.sessionID)
+							if currentSession.metadata != nil {


Issue 1 — defensive metadata != nil guard is dead code that silently drops metrics if the invariant ever breaks.

This guard (and the four equivalents added at lines 269, 281, 292, 304 in processChunkWithReordering) is unreachable today: a syncSession is only constructed at line 173 inside the if req.GetMetadata() != nil branch (line 157), so currentSession.metadata is never nil after creation.

The risk is in the future: if a refactor ever lets metadata become nil, this guard silently swallows exactly the chunk-ordering metrics we use to detect bad sync state — no log line, no series, just a missing signal for finish_sync_err / gap_too_large / buffer_full.

Suggest one of:

Drop the guard and rely on the documented invariant.

Or, fall back to a sentinel topic ("unknown") and emit s.log.Warn(...) so any breach surfaces.

hanahmily · 2026-04-27T10:09:10Z

 	case "chunk_buffered":
-		s.metrics.chunksBuffered.Inc(1, sessionID)
+		s.metrics.chunksBuffered.Inc(1, topic)
 	case "buffer_timeout":


Issue 2 — "buffer_timeout" event is wired here but never fired anywhere.

$ grep -rn '"buffer_timeout"' --include='*.go' . banyand/queue/sub/server.go:457 ← only this case statement

checkBufferTimeout (chunked_sync.go:434) is the natural producer but returns the error without recording the metric. Worse, checkBufferTimeout itself only has a test caller (chunked_sync_test.go:280); no production code path appears to invoke it.

Since this PR is exactly about cleaning up queue_sub chunk-ordering observability, please pick one:

Wire it up: add s.updateChunkOrderMetrics("buffer_timeout", session.metadata.Topic) inside checkBufferTimeout, and invoke checkBufferTimeout from a real code path (e.g. processChunkWithReordering or a periodic janitor).

Delete the dead code: remove the bufferTimeouts field, the factory.NewCounter("buffer_timeouts", …) registration, and this case branch.

Leaving it half-wired ships a buffer_timeouts metric to operators that is silently always zero.

hanahmily · 2026-04-27T10:09:10Z


 ### Bug Fixes

+- Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality.


Issue 3 — docs and Grafana panel were ripped out by the revert doc commit (1a7d987), but the PR description still claims they ship.

After upgrade, operators with dashboards or alerts grouped by session_id will silently see empty results, and the orphaned high-cardinality *_session_id series will linger in Prometheus storage until retention expires.

Either restore the observability.md and grafana-cluster.json updates, or extend this changelog entry with an explicit upgrade note (e.g. "operators must update dashboards keyed on session_id; existing per-session series will age out with retention").

I checked docs/operation/grafana-cluster.json and searched for session_id and those metric names (like out_of_order_chunks_received, chunks_buffered, buffer_timeouts, etc.), but couldn’t find anything.

In docs/operation/observability.md, there’s only one mention of banyandb_queue_sub_total_msg_sent_err (line 69). There’s no session_id, nor any of the chunk-ordering metrics introduced in this change.

That’s why I reverted the docs for now and plan to include the original metrics as well as the additional metrics introduced later in the final PR.

hanahmily · 2026-04-27T10:09:10Z


 ### Bug Fixes

+- Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality.


Issue 4 — no regression test added.

The PR template's Add a unit test to verify that the fix works checkbox is still unchecked. The existing fixtures in chunked_sync_test.go (e.g. test-session-1 at line 307, test-session-2 at line 364) already construct sessions with metadata.Topic set, so a small test that wires a mock counter into s.metrics, drives an out-of-order chunk, and asserts that the emitted label equals the topic (and that no per-session series is created) would be straightforward.

Without it, a future revert to req.SessionId would silently restore the cardinality blow-up this PR is meant to prevent.

…-banyandb into queue-obs-pr1

OmCheeLin added 3 commits April 26, 2026 22:54

fix(queue/sub): label chunk-ordering metrics by topic not session_id

3e5eddd

session_id is per-sync UUID and causes unbounded Prometheus cardinality. Use sync metadata topic at each updateChunkOrderMetrics call site. Made-with: Cursor

docs: queue_sub chunk-ordering metrics (topic label) and Grafana panel

b558758

- observability.md: document counters and PromQL grouped by topic - grafana-cluster.json: row + timeseries for ordering event rates by topic Made-with: Cursor

update CHANGES.md

554e3d8

OmCheeLin marked this pull request as draft April 26, 2026 15:05

OmCheeLin marked this pull request as ready for review April 26, 2026 22:24

Merge branch 'main' into queue-obs-pr1

4628f3b

hanahmily requested a review from Copilot April 27, 2026 03:52

Copilot started reviewing on behalf of hanahmily April 27, 2026 03:52 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

OmCheeLin added 2 commits April 27, 2026 15:39

revert doc

1a7d987

Merge branch 'queue-obs-pr1' of github-OmCheeLin:OmCheeLin/skywalking…

491cc1a

…-banyandb into queue-obs-pr1

hanahmily requested changes Apr 27, 2026

View reviewed changes

hanahmily and others added 3 commits April 27, 2026 18:16

Merge branch 'main' into queue-obs-pr1

44ecc1e

delete dead code metadata != nil

5e938b5

Merge branch 'queue-obs-pr1' of github-OmCheeLin:OmCheeLin/skywalking…

94d3239

…-banyandb into queue-obs-pr1

OmCheeLin marked this pull request as draft April 27, 2026 12:22

OmCheeLin added 2 commits April 27, 2026 20:35

trigger metric buffer_timeout

6fa5c6e

modify and add test

c4d8b3f

OmCheeLin marked this pull request as ready for review April 27, 2026 14:07


		### Bug Fixes

		- Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality.

Conversation

OmCheeLin commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Use topic instead of session_id as the Prometheus label on liaison queue_sub chunk-ordering counters to avoid unbounded metric cardinality.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hanahmily left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

OmCheeLin Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OmCheeLin commented Apr 26, 2026 •

edited

Loading

Use `topic` instead of `session_id` as the Prometheus label on liaison `queue_sub` chunk-ordering counters to avoid unbounded metric cardinality.

codecov-commenter commented Apr 27, 2026 •

edited

Loading

OmCheeLin Apr 27, 2026 •

edited

Loading