feat: add queue latency instrumentation to principal by drewbailey · Pull Request #852 · argoproj-labs/argocd-agent

drewbailey · 2026-03-27T14:07:12Z

Add three new principal-side histograms to isolate which stage owns propagation latency:

principal_send_queue_dwell_seconds: time from SendQ.Add to EventWriter.Add
principal_event_writer_dwell_seconds: time from EventWriter.Add to wire send
principal_ack_roundtrip_seconds: time from wire send to ACK received

Implementation:

enqueuedAt CloudEvents extension stamped in boundedQueue.Add (single site covers all enqueue call sites throughout the codebase)
writerAddedAt field on eventMessage set in EventWriter.Add
EventWriter gains optional principalMetrics field via SetMetrics; nil-safe so agent-side and test call sites are unaffected
Observations wired in sendFunc (SendQueueDwell) and recvFunc (AckRoundtrip) in the eventstream server

What does this PR do / why we need it:

Which issue(s) this PR fixes:

Fixes #?

How to test changes / Special notes to the reviewer:

Checklist

Documentation update is required by this PR (and has been updated) OR no documentation update is required.

Summary by CodeRabbit

New Features
- Added hop-by-hop timing metrics (send-queue dwell, writer dwell, ACK roundtrip).
- Events now record an enqueued timestamp; send timestamp is refreshed on retry.
- Event writers can be wired at runtime to emit dwell/roundtrip observations and report resource type for ACKs.
Documentation
- Expanded metrics docs with hop-by-hop sections, new metric names, and updated labels/catalog.
Tests
- Added tests validating histogram observations and writer/retry metric behaviors.

coderabbitai · 2026-03-27T14:07:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds an enqueuedat CloudEvent extension and helpers, wires hop-by-hop Prometheus histograms for send-queue dwell, event-writer dwell, and ACK roundtrip into queue/writer/agent/principal flows, and records/observes those metrics at enqueue, send, retry, and ACK handling points.

Changes

Cohort / File(s)	Summary
Event timestamp helpers `internal/event/event.go`	Add `enqueuedat` constant and helpers `SetEnqueuedAt(cloudevents.Event)` and `EnqueuedAt(cloudevents.Event) *time.Time`.
Event writer instrumentation & API `internal/event/event_writer.go`	Add `SetMetrics(metrics.OutboundHopMetrics)` and `SentResourceType(resID string) string`; track `eventMessage.writerAddedAt` and `writerDwellObserved`, call `SetSentAt` on retries, and observe EventWriterDwell when publishes succeed (metrics wired).
Queue enqueue hook `internal/queue/queue.go`	Invoke `event.SetEnqueuedAt()` before enqueuing `*event.Event`.
Metrics definitions & API `internal/metrics/metrics.go`	Introduce `outboundHopLatencyBuckets`, `OutboundHopMetrics` interface; add `SendQueueDwell`, `EventWriterDwell`, `AckRoundtrip` histograms to `PrincipalMetrics` and `AgentMetrics`, register histograms, and implement Observe* methods.
Principal eventstream integration `principal/.../eventstream.go`	Observe AckRoundtrip on ACKs (using `SentAt`), observe SendQueueDwell before writer enqueue (using `EnqueuedAt`), and call `eventWriter.SetMetrics(s.metrics)` when (re)creating writers.
Agent integration & tests `agent/connection.go`, `agent/connection_test.go`, `internal/event/event_writer_test.go`	Agent reports SendQueueDwell and AckRoundtrip; writers wired with metrics. Add tests and test helpers validating SendQueueDwell, EventWriterDwell, and AckRoundtrip (fake stream/client, histogram helpers, fake metrics).
Docs `docs/operations/metrics.md`	Document new hop-by-hop latency histograms and update principal/agent metric tables and labels.
Module deps `go.mod`	Prometheus `client_model` moved from indirect to direct dependency.

Sequence Diagram

sequenceDiagram
    participant Producer as Producer\n(Client)
    participant Queue as Queue\n(internal/queue)
    participant Writer as EventWriter\n(internal/event)
    participant Principal as Principal\n(eventstream)
    participant Metrics as Prometheus\n(internal/metrics)

    Producer->>Queue: Enqueue Event
    Queue->>Queue: SetEnqueuedAt(event)
    Queue->>Writer: Add(event)
    Writer->>Writer: writerAddedAt = now()
    Writer->>Principal: send event (Send/Retry)
    Principal->>Metrics: ObserveEventWriterDwell(time since writerAddedAt)
    Principal->>Producer: Deliver Event
    Producer->>Principal: ACK
    Principal->>Metrics: ObserveAckRoundtrip(time since SentAt)
    Principal->>Writer: Remove ACK / update writer state

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: measure agent event propagation latency #808 — earlier changes that introduced sentat helpers and send-path timestamping; directly related to timestamp/metrics additions.
chore: move event writer logic to new file #724 — prior EventWriter refactor that this instrumentation builds on.

Suggested reviewers

jgwest
chetan-rns
mikeshng
jannfis

"I stamp each hop with gentle paws,
From queue to writer I count the pause,
Dwell and roundtrip, ticks and cheer,
Tiny hops of metrics drawing near,
A rabbit tallies time and twitches ears." 🐇

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding queue latency instrumentation (metrics) to the principal component for measuring event propagation delays across three stages.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

principal/apis/eventstream/eventstream.go (1)
385-392: ⚠️ Potential issue | 🟠 Major

SetMetrics is only called for new EventWriters, leaving reconnected agents without EventWriterDwell observations.

When an agent reconnects, the existing EventWriter is reused via UpdateTarget() (line 387), but SetMetrics() is not called on this path. Since principalMetrics is initialized to nil in NewEventWriter(), reconnected agents will silently skip EventWriterDwell observations.
🐛 Proposed fix
 	eventWriter := s.eventWriters.Get(c.agentName)
 	if eventWriter != nil {
 		eventWriter.UpdateTarget(subs)
+		eventWriter.SetMetrics(s.metrics)
 	} else {
 		eventWriter = event.NewEventWriter(subs)
 		eventWriter.SetMetrics(s.metrics)
 		s.eventWriters.Add(c.agentName, eventWriter)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@principal/apis/eventstream/eventstream.go` around lines 385 - 392, The
existing EventWriter retrieved by s.eventWriters.Get(c.agentName) isn't having
SetMetrics called, so reconnected agents miss EventWriterDwell observations;
after retrieving eventWriter (before calling UpdateTarget) call
eventWriter.SetMetrics(s.metrics) so both new (created via event.NewEventWriter)
and reused writers get the metrics set (NewEventWriter currently initializes
principal metrics to nil, so ensure SetMetrics is invoked on the reuse path as
well).

🧹 Nitpick comments (1)

internal/event/event_writer.go (1)
501-521: Coalesced events retain the original writerAddedAt, which may not be the intended behavior.

When events of the same type are coalesced in eventQueue.add(), the tail event's event, backoff, and retryAfter are updated, but writerAddedAt is not updated. This means the EventWriterDwell metric will measure the time from when the first event of that type was added, not the latest.

If the intent is to measure dwell time from when the current event data was added, consider also updating writerAddedAt:
♻️ Optional fix to update writerAddedAt on coalesce
 		if ev.event.Type() == tail.event.Type() {
 			tail.event = ev.event
 			tail.backoff = ev.backoff
 			tail.retryAfter = ev.retryAfter
+			tail.writerAddedAt = ev.writerAddedAt
 			tail.mu.Unlock()
 			return
 		}
However, if the current behavior is intentional (measuring total wait time for a resource regardless of updates), please add a comment clarifying this design choice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event/event_writer.go` around lines 501 - 521, In eventQueue.add()
the coalescing branch updates tail.event, tail.backoff and tail.retryAfter but
not tail.writerAddedAt, causing EventWriterDwell to measure from the original
enqueue time; to fix update the tail's writerAddedAt when coalescing (e.g. set
tail.writerAddedAt = ev.writerAddedAt or the current time) inside the if
ev.event.Type() == tail.event.Type() block before unlocking, or if the original
semantics are intentional add a clarifying comment in eventQueue.add()
explaining that dwell should measure from the first enqueue rather than the most
recent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@principal/apis/eventstream/eventstream.go`:
- Around line 385-392: The existing EventWriter retrieved by
s.eventWriters.Get(c.agentName) isn't having SetMetrics called, so reconnected
agents miss EventWriterDwell observations; after retrieving eventWriter (before
calling UpdateTarget) call eventWriter.SetMetrics(s.metrics) so both new
(created via event.NewEventWriter) and reused writers get the metrics set
(NewEventWriter currently initializes principal metrics to nil, so ensure
SetMetrics is invoked on the reuse path as well).

---

Nitpick comments:
In `@internal/event/event_writer.go`:
- Around line 501-521: In eventQueue.add() the coalescing branch updates
tail.event, tail.backoff and tail.retryAfter but not tail.writerAddedAt, causing
EventWriterDwell to measure from the original enqueue time; to fix update the
tail's writerAddedAt when coalescing (e.g. set tail.writerAddedAt =
ev.writerAddedAt or the current time) inside the if ev.event.Type() ==
tail.event.Type() block before unlocking, or if the original semantics are
intentional add a clarifying comment in eventQueue.add() explaining that dwell
should measure from the first enqueue rather than the most recent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a49d5687-adbf-4e07-953b-a5ec2b4d2a43

📥 Commits

Reviewing files that changed from the base of the PR and between 7b21c00 and b78d912.

📒 Files selected for processing (5)

internal/event/event.go
internal/event/event_writer.go
internal/metrics/metrics.go
internal/queue/queue.go
principal/apis/eventstream/eventstream.go

jannfis · 2026-03-30T12:43:14Z

Would it make sense to have these metrics for the agent too, in case it is in autonomous mode?

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@agent/connection.go`:
- Around line 149-156: The ack round-trip is inflated because event.SentAt()
reflects only the first-send timestamp (stamped in
internal/event/event_writer.go) and retries reuse the same CloudEvent; update
the send path so the SentAt header is refreshed on each actual send attempt
(e.g., modify retrySentEvent or the sending routine in event_writer.go to
set/overwrite SentAt with time.Now() before each transmit) so
event.SentAt(rawEvent) in connection.go yields the most recent wire-send time
and the metrics.ObserveAckRoundtrip measurement is correct.

In `@docs/operations/metrics.md`:
- Around line 16-29: The docs table for principal metrics is missing the
repository and GPG key metric series; update the Markdown table to include the
registered metrics from internal/metrics/metrics.go such as
principal_repositories_created, principal_repositories_updated,
principal_repositories_deleted (counter) and principal_gpg_keys_created,
principal_gpg_keys_updated, principal_gpg_keys_deleted (counter) with
appropriate Type, Labels (if any), and short descriptions mirroring the other
rows so the reference matches the actual metric registrations (e.g., align with
principal_applications_* and principal_app_projects_* entries).
- Around line 35-39: The fenced diagram block containing "principal send queue 
→  event writer  →  wire send  →  agent  →  ACK received" and the labels
"SendQueueDwell", "EventWriterDwell", "AckRoundtrip" should include a language
hint to satisfy MD040; update the triple-backtick fence to specify "text" (i.e.,
```text) so the pipeline diagram fence is labeled as plain text.

In `@internal/event/event_writer.go`:
- Around line 439-445: The dwell histogram (using
ew.outboundMetrics.ObserveEventWriterDwell with eventMsg.writerAddedAt) is being
recorded before the send, which under-reports stalls if target.Send(...) fails;
move the ObserveEventWriterDwell call so it runs only after target.Send(...)
returns successfully (i.e., after the send is accepted and after any code that
adds the event to sentEvents), preserving the existing isFireAndForget guard and
the check for eventMsg.writerAddedAt.IsZero() and nil ew.outboundMetrics; ensure
you still hold or snapshot ew.outboundMetrics as done now to avoid races.
- Around line 163-167: When adding a new eventMessage in eventQueue.add (eq.add)
the writerAddedAt timestamp is left as the older value when the queue node is
coalesced; update eventQueue.add so that when it replaces the tail's
event/backoff/retryAfter it also updates the node's writerAddedAt to the current
time (the same now used when creating the new eventMessage). Locate the code
that swaps the existing node's event/backoff/retryAfter (references:
eventQueue.add, eq.add, eventMessage.writerAddedAt) and assign writerAddedAt =
now whenever the node is updated so *_event_writer_dwell_seconds reflects the
coalesced event's actual enqueue time.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dba72327-c261-4098-9714-e7c0f3cfc3c5

📥 Commits

Reviewing files that changed from the base of the PR and between b78d912 and 260e39b.

📒 Files selected for processing (7)

agent/connection.go
agent/connection_test.go
docs/operations/metrics.md
internal/event/event.go
internal/event/event_writer.go
internal/event/event_writer_test.go
internal/metrics/metrics.go

✅ Files skipped from review due to trivial changes (2)

agent/connection_test.go
internal/event/event.go

agent/connection.go

docs/operations/metrics.md

internal/event/event_writer.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

internal/event/event_writer.go (1)
79-80: Clarify writerAddedAt semantics.

This field is refreshed during tail coalescing at Line 532, so “immutable once set” is misleading now. Reword the comment to describe the enqueue time of the currently queued event instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/event/event_writer.go` around lines 79 - 80, The comment for the
EventWriter field writerAddedAt is misleading because the value is updated
during tail coalescing; update the comment to say it records the enqueue time of
the currently queued event (i.e., when the event was added to the writer's queue
and may be refreshed during tail coalescing) rather than claiming it is
immutable once set. Reference the EventWriter struct and the writerAddedAt field
in the comment so readers know this timestamp represents the current queued
event's enqueue time.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/event/event_writer.go`:
- Line 352: The histogram misses long stalls when an event's initial Send fails
because dwell is only observed in the retry-completion path; update the path
that marks an event as sent (SetSentAt called for sentMsg.event) to also emit
ObserveEventWriterDwell so first-success-after-retry is recorded. Specifically,
when SetSentAt(sentMsg.event) is executed for an entry still tracked in
sentEvents, compute the dwell as now minus the event's original
enqueue/first-attempt timestamp (use the same timestamp used elsewhere to
measure dwell) and call ObserveEventWriterDwell with the same labels used in the
retry-completion code; ensure you use sentMsg (and sentMsg.event) and the same
label set so metrics remain consistent.

---

Nitpick comments:
In `@internal/event/event_writer.go`:
- Around line 79-80: The comment for the EventWriter field writerAddedAt is
misleading because the value is updated during tail coalescing; update the
comment to say it records the enqueue time of the currently queued event (i.e.,
when the event was added to the writer's queue and may be refreshed during tail
coalescing) rather than claiming it is immutable once set. Reference the
EventWriter struct and the writerAddedAt field in the comment so readers know
this timestamp represents the current queued event's enqueue time.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 55b69cf2-9bd8-4b46-a698-958031ff3e08

📥 Commits

Reviewing files that changed from the base of the PR and between 260e39b and 6042bdd.

📒 Files selected for processing (5)

docs/operations/metrics.md
go.mod
internal/event/event_writer.go
internal/event/event_writer_test.go
principal/apis/eventstream/eventstream.go

🚧 Files skipped from review as they are similar to previous changes (2)

principal/apis/eventstream/eventstream.go
internal/event/event_writer_test.go

internal/event/event_writer.go

coderabbitai

🧹 Nitpick comments (1)

agent/connection_test.go (1)

87-87: Isolate metrics instance per subtest to reduce state coupling.

Line 87 shares one AgentMetrics across both subtests. It works now, but it makes cases order-coupled and less future-proof if subtests become parallelized.

♻️ Suggested test-isolation refactor

 func TestAgentHopMetricsObservations(t *testing.T) {
-	agentMetrics := metrics.NewAgentMetrics()
-
 	t.Run("sender observes send queue dwell", func(t *testing.T) {
+		agentMetrics := metrics.NewAgentMetrics()
 		a, _ := newAgent(t)
 		a.metrics = agentMetrics
 		a.emitter = event.NewEventSource("test")
@@
-		before := histogramSampleCount(t, a.metrics.SendQueueDwell, event.TargetApplication.String())
+		before := histogramSampleCount(t, a.metrics.SendQueueDwell, event.TargetApplication.String())
@@
 	t.Run("receiver observes ack roundtrip by original resource type", func(t *testing.T) {
+		agentMetrics := metrics.NewAgentMetrics()
 		a, _ := newAgent(t)
 		a.metrics = agentMetrics
 		a.emitter = event.NewEventSource("test")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@agent/connection_test.go` at line 87, Shared AgentMetrics instance causes
state coupling between subtests; instantiate a fresh metrics object inside each
subtest to isolate state. Move the call to metrics.NewAgentMetrics() from the
outer scope into each t.Run subtest body (or create a local agentMetrics
variable at the start of each subtest) so tests don't share mutable state and
can be parallelized safely; update any references in those subtests to use the
local agentMetrics variable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@agent/connection_test.go`:
- Line 87: Shared AgentMetrics instance causes state coupling between subtests;
instantiate a fresh metrics object inside each subtest to isolate state. Move
the call to metrics.NewAgentMetrics() from the outer scope into each t.Run
subtest body (or create a local agentMetrics variable at the start of each
subtest) so tests don't share mutable state and can be parallelized safely;
update any references in those subtests to use the local agentMetrics variable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c7cff445-e0aa-49d2-a19f-135ae2846d30

📥 Commits

Reviewing files that changed from the base of the PR and between 6042bdd and 5d32c3c.

📒 Files selected for processing (1)

agent/connection_test.go

codecov-commenter · 2026-03-30T17:34:29Z

Codecov Report

❌ Patch coverage is 50.37594% with 66 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.59%. Comparing base (8b490ef) to head (f456a7d).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/metrics/metrics.go	27.77%	39 Missing ⚠️
internal/event/event_writer.go	76.08%	11 Missing ⚠️
internal/event/event.go	0.00%	10 Missing ⚠️
agent/connection.go	75.00%	2 Missing and 1 partial ⚠️
principal/apis/eventstream/eventstream.go	70.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #852      +/-   ##
==========================================
+ Coverage   46.32%   46.59%   +0.27%     
==========================================
  Files         122      122              
  Lines       17397    17536     +139     
==========================================
+ Hits         8059     8171     +112     
- Misses       8596     8619      +23     
- Partials      742      746       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

🧹 Nitpick comments (1)

internal/event/event_writer_test.go (1)

554-555: Remove unnecessary sleeps in metrics-only assertions.

At Line 554, Line 571, Line 587, and Line 615, the sleeps are not needed for current assertions (len(...) and seconds >= 0). Removing them will reduce test runtime and flake risk without changing intent.

♻️ Suggested cleanup

-		time.Sleep(5 * time.Millisecond)
-
 		evSender.sendEvent(createResourceID(app1.ObjectMeta))

-		time.Sleep(5 * time.Millisecond)
-
 		evSender.sendEvent(createResourceID(app1.ObjectMeta))

-		time.Sleep(5 * time.Millisecond)
-
 		evSender.sendEvent(resID)

-		time.Sleep(5 * time.Millisecond)
-
 		evSender.sendEvent(resID)

Also applies to: 571-572, 587-588, 615-616

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@internal/event/event_writer_test.go` around lines 554 - 555, Remove the
unnecessary time.Sleep(5 * time.Millisecond) calls used solely for metrics-only
assertions in internal/event/event_writer_test.go: locate the test cases calling
time.Sleep before asserting len(...) of metrics slices or checking seconds >= 0
and delete those sleep lines (they are the only use of time.Sleep in those
blocks), leaving the assertions as-is; run the tests to ensure no other timing
dependency remains and remove any now-unused imports if applicable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/event/event_writer_test.go`:
- Around line 554-555: Remove the unnecessary time.Sleep(5 * time.Millisecond)
calls used solely for metrics-only assertions in
internal/event/event_writer_test.go: locate the test cases calling time.Sleep
before asserting len(...) of metrics slices or checking seconds >= 0 and delete
those sleep lines (they are the only use of time.Sleep in those blocks), leaving
the assertions as-is; run the tests to ensure no other timing dependency remains
and remove any now-unused imports if applicable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d55e22c6-41c1-439e-af9a-0cb7f38d60f1

📥 Commits

Reviewing files that changed from the base of the PR and between 5d32c3c and 9a672ef.

📒 Files selected for processing (2)

internal/event/event_writer.go
internal/event/event_writer_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

internal/event/event_writer.go

Add three new principal-side histograms to isolate which stage owns the observed ~10s propagation latency plateau: - principal_send_queue_dwell_seconds: time from SendQ.Add to EventWriter.Add - principal_event_writer_dwell_seconds: time from EventWriter.Add to wire send - principal_ack_roundtrip_seconds: time from wire send to ACK received Implementation: - enqueuedAt CloudEvents extension stamped in boundedQueue.Add (single site covers all enqueue call sites throughout the codebase) - writerAddedAt field on eventMessage set in EventWriter.Add - EventWriter gains optional principalMetrics field via SetMetrics; nil-safe so agent-side and test call sites are unaffected - Observations wired in sendFunc (SendQueueDwell) and recvFunc (AckRoundtrip) in the eventstream server Signed-off-by: Drew Bailey <drew.bailey@airbnb.com>

Under heavy load or during reconnects events can dwell longer than the previous 10s max bucket, causing all high-latency observations to accumulate in +Inf and making percentiles uncomputable. Signed-off-by: Drew Bailey <drew.bailey@airbnb.com>

… metrics Add documentation for principal_send_queue_dwell_seconds, principal_event_writer_dwell_seconds, principal_ack_roundtrip_seconds, and the rest_client_* k8s API client metrics. Restructure the table to include labels column and add pipeline diagram showing what each hop-by-hop metric measures. Signed-off-by: Drew Bailey <drew.bailey@airbnb.com>

Signed-off-by: Drew Bailey <drew.bailey@airbnb.com>

drewbailey requested review from chetan-rns, jannfis, jgwest and mikeshng as code owners March 27, 2026 14:07

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

agent/connection.go Show resolved Hide resolved

docs/operations/metrics.md Show resolved Hide resolved

docs/operations/metrics.md Outdated Show resolved Hide resolved

internal/event/event_writer.go Show resolved Hide resolved

internal/event/event_writer.go Outdated Show resolved Hide resolved

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

internal/event/event_writer.go Show resolved Hide resolved

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

jannfis added this to the v0.8.0 milestone Apr 3, 2026

coderabbitai bot reviewed Apr 7, 2026

View reviewed changes

Drew Bailey added 7 commits April 7, 2026 11:20

Add agent-side hop metrics

c0293f7

Signed-off-by: Drew Bailey <drew.bailey@airbnb.com>

fix: address hop metrics review feedback

0ec570c

Signed-off-by: Drew Bailey <drew.bailey@airbnb.com>

test: fix hop metrics race assertion

b2ce859

Signed-off-by: Drew Bailey <drew.bailey@airbnb.com>

Fix event writer dwell metrics on retry

f456a7d

Signed-off-by: Drew Bailey <drew.bailey@airbnb.com>

drewbailey force-pushed the db--principal-hop-metrics branch from 9a672ef to f456a7d Compare April 7, 2026 16:35

Conversation

drewbailey commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Review ran into problems

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

jannfis commented Mar 30, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drewbailey commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

codecov-commenter commented Mar 30, 2026 •

edited

Loading