Skip to content

history: add RPC P99 latency check to DeepHealthCheck#9419

Draft
laniehei wants to merge 3 commits intotemporalio:mainfrom
laniehei:laniehei-workflow-lock-wait-metric
Draft

history: add RPC P99 latency check to DeepHealthCheck#9419
laniehei wants to merge 3 commits intotemporalio:mainfrom
laniehei:laniehei-workflow-lock-wait-metric

Conversation

@laniehei
Copy link
Member

@laniehei laniehei commented Feb 27, 2026

Summary

  • Adds CheckTypePersistenceP99Latency (check 6) to DeepHealthCheck with a configurable threshold (default 500ms)
  • Adds CheckTypeRPCP99Latency (check 7) to DeepHealthCheck with a configurable threshold (default 1000ms)
  • Adds Percentile(p float64) method to MovingWindowAvgImpl to compute percentiles over the moving window
  • Adds P99Latency() to both persistence.HealthSignalAggregator and interceptor.HealthSignalAggregator
  • Emits history_rpc_p99_latency_ms gauge on each DeepHealthCheck call

Motivation

The existing DeepHealthCheck checks average latency and error ratio. These are blind to incidents where a single pod has degraded persistence connectivity: the average latency stays low because only a small fraction of operations (e.g. shard recovery) are slow, and requests hang to deadline rather than returning errors so the error ratio stays near zero.

The P99 checks catch tail latency degradation that averages hide:

  • Persistence P99 fires on the affected pod during shard recovery when persistence operations are slow — detecting the root cause before downstream callers are affected
  • RPC P99 fires when inbound RPCs are hanging — a backstop for latency issues that don't go through persistence

Test plan

  • go test ./common/aggregate/...Percentile method
  • go test ./common/persistence/...P99Latency() on persistence aggregator
  • go test ./common/rpc/interceptor/...P99Latency() on RPC aggregator
  • go test ./service/history/... — handler + config wiring

🤖 Generated with Claude Code

laniehei and others added 3 commits February 26, 2026 16:20
Adds a new CheckTypeRPCP99Latency (check 6) to the history service
DeepHealthCheck. The existing average latency and error ratio checks
were blind to the s-gc014 incident pattern where a single pod with
degraded Astra connectivity caused cell-wide P99 latency to spike to
~31s while average latency stayed at ~18ms.

Changes:
- common/aggregate: Add Percentile(p float64) method to MovingWindowAvgImpl
- common/rpc/interceptor: Add P99Latency() to HealthSignalAggregator interface and impl
- common/health: Add CheckTypeRPCP99Latency constant
- common/dynamicconfig: Add HealthRPCP99LatencyFailure setting (default 1000ms)
- service/history/configs: Wire HealthRPCP99LatencyFailure
- service/history/handler: Add check 6 + emit history_rpc_p99_latency_ms gauge
- docs: Add incident analysis for recurring s-gc014 P99 latency pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds CheckTypePersistenceP99Latency (check 6) alongside the existing
RPC P99 check (now check 7). During the s-gc014 incident, GetOrCreateShard
averaged 5.47s on the affected pod while all other operations were in the
ms range — the persistence P99 check would have fired on the pod itself
during shard recovery, before cross-shard callers started timing out.

Changes:
- common/persistence: Add P99Latency() to HealthSignalAggregator interface,
  impl (type assertion to *MovingWindowAvgImpl), and noop
- common/health: Add CheckTypePersistenceP99Latency constant
- common/dynamicconfig: Add HealthPersistenceP99LatencyFailure (default 500ms)
- service/history/configs: Wire HealthPersistenceP99LatencyFailure
- service/history/handler: Add check 6 (persistence P99); RPC P99 becomes check 7
- docs: Update incident analysis with persistence P99 findings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant