[server] Add SLO classification dimensions to record-level delay metric#2736
[server] Add SLO classification dimensions to record-level delay metric#2736sushantmane wants to merge 4 commits intolinkedin:mainfrom
Conversation
Add three new OTel dimensions to ingestion.replication.record.delay for nearline write SLO tier classification: - venice.region.locality (LOCAL/REMOTE): computed by comparing the record's source region against the server's local region - venice.write_compute.status (WRITE_COMPUTE_ENABLED/DISABLED): from store config, distinguishes full PUT vs write compute workloads - venice.chunking.status (CHUNKED/UNCHUNKED): from version config, identifies stores that handle large values All three are static base dimensions (set once per store/version), so they add zero cardinality overhead. These dimensions enable SLO queries like: "P99 leader record delay for local-colo, non-WC, non-chunked stores" without requiring external config JOINs at query time.
Rename VENICE_WRITE_COMPUTE_STATUS → VENICE_PARTIAL_UPDATE_STATUS and VeniceWriteComputeStatus → VenicePartialUpdateStatus to use the user-facing terminology (partial update) rather than the internal implementation name (write compute).
Add testRecordLevelDelaySloDimensions to NearlineE2ELatencyTest that verifies the new OTel dimensions (region locality, partial update status, chunking status) are correctly emitted on the ingestion.replication.record.delay histogram after nearline ingestion.
Replace VenicePartialUpdateStatus (partial_update_enabled/disabled) with VeniceStoreWriteType (regular_put/partial_update) for clearer SLO tier naming. Dimension renamed from venice.partial_update.status to venice.store.write_type.
There was a problem hiding this comment.
Pull request overview
Adds SLO-tier classification dimensions to the server-side OTel record-level delay metric (ingestion.replication.record.delay) so operators can slice latency/SLOs by locality, partial-update status, and chunking without joining against external store config.
Changes:
- Extend
RecordLevelDelayOtelStats/RecordLevelDelayOtelMetricEntityto emit new dimensions:venice.region.locality,venice.partial_update.status,venice.chunking.status. - Plumb local region + store metadata into
HeartbeatVersionedStatsso the delay metric can be tagged at stats creation time. - Add unit + integration/E2E tests validating the new dimensions are present on emitted histograms.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/NearlineE2ELatencyTest.java | Adds an E2E assertion that at least one server emits the record-delay histogram with expected SLO dimensions. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VenicePartialUpdateStatusTest.java | New unit test for the partial-update dimension enum wiring/value mapping. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VenicePartialUpdateStatus.java | Introduces VenicePartialUpdateStatus dimension enum. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java | Registers the new VENICE_PARTIAL_UPDATE_STATUS dimension name. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStatsTest.java | Updates/extends tests to validate added SLO dimensions on record-delay histograms. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelMetricEntityTest.java | Updates metric-entity dimension expectations to include the new dimensions. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatVersionedStatsTest.java | Adapts tests for new constructor args and validates record-delay dimension tagging. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStats.java | Adds locality/partial-update/chunking tagging via base dimensions + per-region base dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelMetricEntity.java | Expands the metric entity’s required dimension set to include SLO dimensions. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatVersionedStats.java | Supplies local region and store-derived flags into RecordLevelDelayOtelStats creation. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringService.java | Passes localRegionName into HeartbeatVersionedStats to enable locality computation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Store store = metadataRepository.getStore(storeName); | ||
| boolean partialUpdateEnabled = store != null && store.isWriteComputationEnabled(); | ||
| boolean chunkingEnabled = store != null && store.isChunkingEnabled(); |
There was a problem hiding this comment.
chunkingEnabled for the SLO dimension is derived from store.isChunkingEnabled(), but the PR description states the source should be version.isChunkingEnabled(). Since chunking is a version-level attribute (and can diverge from the store-level flag for existing versions), this may misclassify the metric for CURRENT/FUTURE roles. Consider deriving chunking status from the relevant Version object(s) (e.g., current/future version) or otherwise align the implementation/description so the emitted dimension reflects the intended semantics.
| */ | ||
| @Test | ||
| public void testWriteComputeEnabledDimension() { | ||
| RecordLevelDelayOtelStats wcStats = new RecordLevelDelayOtelStats( |
There was a problem hiding this comment.
This test Javadoc mentions write_compute_status=write_compute_enabled, but the new dimension being asserted is venice.partial_update.status with values like partial_update_enabled/disabled. Please update the comment to match the actual dimension name/value semantics to avoid confusion when reading the test failures.
Problem Statement
The
ingestion.replication.record.delayOTel metric lacks dimensions needed for nearline write SLO tier classification. To measure SLO attainment per workload tier, operators must JOIN metric data with external store config — complex, fragile, and not supported natively in Grafana/PromQL.Solution
Add three new OTel base dimensions to
ingestion.replication.record.delay:venice.region.localitylocal,remoteserverConfig.getRegionName()venice.partial_update.statuspartial_update_enabled,partial_update_disabledstore.isWriteComputationEnabled()venice.chunking.statuschunked,unchunkedversion.isChunkingEnabled()All three are static base dimensions (set once per store/version, like store name), so they add zero cardinality overhead.
Region locality is computed once per region at metric state creation time. Partial update and chunking status are looked up from the store metadata repository when the
RecordLevelDelayOtelStatsis lazily created.Code changes
ingestion.replication.record.delaymetric gated byserver.record.level.timestamp.enabledandserver.per.record.otel.metrics.enabled(both default: false). No new configs introduced.Concurrency-Specific Checks
VeniceConcurrentHashMapfor per-region metric states (unchanged).How was this PR tested?
Unit tests:
VenicePartialUpdateStatusTest,RecordLevelDelayOtelStatsTest(3 new: region locality, partial update, chunking),HeartbeatVersionedStatsTest,RecordLevelDelayOtelMetricEntityTestE2E test:
NearlineE2ELatencyTest.testRecordLevelDelaySloDimensions— verifies SLO dimensions through the full nearline ingestion pipelineDoes this PR introduce any user-facing or breaking changes?