history: add RPC P99 latency check to DeepHealthCheck#9419
Draft
laniehei wants to merge 3 commits intotemporalio:mainfrom
Draft
history: add RPC P99 latency check to DeepHealthCheck#9419laniehei wants to merge 3 commits intotemporalio:mainfrom
laniehei wants to merge 3 commits intotemporalio:mainfrom
Conversation
Adds a new CheckTypeRPCP99Latency (check 6) to the history service DeepHealthCheck. The existing average latency and error ratio checks were blind to the s-gc014 incident pattern where a single pod with degraded Astra connectivity caused cell-wide P99 latency to spike to ~31s while average latency stayed at ~18ms. Changes: - common/aggregate: Add Percentile(p float64) method to MovingWindowAvgImpl - common/rpc/interceptor: Add P99Latency() to HealthSignalAggregator interface and impl - common/health: Add CheckTypeRPCP99Latency constant - common/dynamicconfig: Add HealthRPCP99LatencyFailure setting (default 1000ms) - service/history/configs: Wire HealthRPCP99LatencyFailure - service/history/handler: Add check 6 + emit history_rpc_p99_latency_ms gauge - docs: Add incident analysis for recurring s-gc014 P99 latency pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds CheckTypePersistenceP99Latency (check 6) alongside the existing RPC P99 check (now check 7). During the s-gc014 incident, GetOrCreateShard averaged 5.47s on the affected pod while all other operations were in the ms range — the persistence P99 check would have fired on the pod itself during shard recovery, before cross-shard callers started timing out. Changes: - common/persistence: Add P99Latency() to HealthSignalAggregator interface, impl (type assertion to *MovingWindowAvgImpl), and noop - common/health: Add CheckTypePersistenceP99Latency constant - common/dynamicconfig: Add HealthPersistenceP99LatencyFailure (default 500ms) - service/history/configs: Wire HealthPersistenceP99LatencyFailure - service/history/handler: Add check 6 (persistence P99); RPC P99 becomes check 7 - docs: Update incident analysis with persistence P99 findings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CheckTypePersistenceP99Latency(check 6) toDeepHealthCheckwith a configurable threshold (default 500ms)CheckTypeRPCP99Latency(check 7) toDeepHealthCheckwith a configurable threshold (default 1000ms)Percentile(p float64)method toMovingWindowAvgImplto compute percentiles over the moving windowP99Latency()to bothpersistence.HealthSignalAggregatorandinterceptor.HealthSignalAggregatorhistory_rpc_p99_latency_msgauge on eachDeepHealthCheckcallMotivation
The existing
DeepHealthCheckchecks average latency and error ratio. These are blind to incidents where a single pod has degraded persistence connectivity: the average latency stays low because only a small fraction of operations (e.g. shard recovery) are slow, and requests hang to deadline rather than returning errors so the error ratio stays near zero.The P99 checks catch tail latency degradation that averages hide:
Test plan
go test ./common/aggregate/...—Percentilemethodgo test ./common/persistence/...—P99Latency()on persistence aggregatorgo test ./common/rpc/interceptor/...—P99Latency()on RPC aggregatorgo test ./service/history/...— handler + config wiring🤖 Generated with Claude Code