history: add RPC P99 latency check to DeepHealthCheck by laniehei · Pull Request #9419 · temporalio/temporal

laniehei · 2026-02-27T00:21:02Z

Summary

Adds CheckTypePersistenceP99Latency (check 6) to DeepHealthCheck with a configurable threshold (default 500ms)
Adds CheckTypeRPCP99Latency (check 7) to DeepHealthCheck with a configurable threshold (default 1000ms)
Adds Percentile(p float64) method to MovingWindowAvgImpl to compute percentiles over the moving window
Adds P99Latency() to both persistence.HealthSignalAggregator and interceptor.HealthSignalAggregator
Emits history_rpc_p99_latency_ms gauge on each DeepHealthCheck call

Motivation

The existing DeepHealthCheck checks average latency and error ratio. These are blind to incidents where a single pod has degraded persistence connectivity: the average latency stays low because only a small fraction of operations (e.g. shard recovery) are slow, and requests hang to deadline rather than returning errors so the error ratio stays near zero.

The P99 checks catch tail latency degradation that averages hide:

Persistence P99 fires on the affected pod during shard recovery when persistence operations are slow — detecting the root cause before downstream callers are affected
RPC P99 fires when inbound RPCs are hanging — a backstop for latency issues that don't go through persistence

Test plan

go test ./common/aggregate/... — Percentile method
go test ./common/persistence/... — P99Latency() on persistence aggregator
go test ./common/rpc/interceptor/... — P99Latency() on RPC aggregator
go test ./service/history/... — handler + config wiring

🤖 Generated with Claude Code

Adds a new CheckTypeRPCP99Latency (check 6) to the history service DeepHealthCheck. The existing average latency and error ratio checks were blind to the s-gc014 incident pattern where a single pod with degraded Astra connectivity caused cell-wide P99 latency to spike to ~31s while average latency stayed at ~18ms. Changes: - common/aggregate: Add Percentile(p float64) method to MovingWindowAvgImpl - common/rpc/interceptor: Add P99Latency() to HealthSignalAggregator interface and impl - common/health: Add CheckTypeRPCP99Latency constant - common/dynamicconfig: Add HealthRPCP99LatencyFailure setting (default 1000ms) - service/history/configs: Wire HealthRPCP99LatencyFailure - service/history/handler: Add check 6 + emit history_rpc_p99_latency_ms gauge - docs: Add incident analysis for recurring s-gc014 P99 latency pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds CheckTypePersistenceP99Latency (check 6) alongside the existing RPC P99 check (now check 7). During the s-gc014 incident, GetOrCreateShard averaged 5.47s on the affected pod while all other operations were in the ms range — the persistence P99 check would have fired on the pod itself during shard recovery, before cross-shard callers started timing out. Changes: - common/persistence: Add P99Latency() to HealthSignalAggregator interface, impl (type assertion to *MovingWindowAvgImpl), and noop - common/health: Add CheckTypePersistenceP99Latency constant - common/dynamicconfig: Add HealthPersistenceP99LatencyFailure (default 500ms) - service/history/configs: Wire HealthPersistenceP99LatencyFailure - service/history/handler: Add check 6 (persistence P99); RPC P99 becomes check 7 - docs: Update incident analysis with persistence P99 findings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

laniehei and others added 3 commits February 26, 2026 16:20

Remove internal incident doc from public repo

46b6004

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history: add RPC P99 latency check to DeepHealthCheck#9419

history: add RPC P99 latency check to DeepHealthCheck#9419
laniehei wants to merge 3 commits intotemporalio:mainfrom
laniehei:laniehei-workflow-lock-wait-metric

laniehei commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

laniehei commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

laniehei commented Feb 27, 2026 •

edited

Loading