Skip to content

[server] Add SLO classification dimensions to record-level delay metric#2736

Open
sushantmane wants to merge 4 commits intolinkedin:mainfrom
sushantmane:sumane/slo-record-delay-dimensions
Open

[server] Add SLO classification dimensions to record-level delay metric#2736
sushantmane wants to merge 4 commits intolinkedin:mainfrom
sushantmane:sumane/slo-record-delay-dimensions

Conversation

@sushantmane
Copy link
Copy Markdown
Contributor

Problem Statement

The ingestion.replication.record.delay OTel metric lacks dimensions needed for nearline write SLO tier classification. To measure SLO attainment per workload tier, operators must JOIN metric data with external store config — complex, fragile, and not supported natively in Grafana/PromQL.

Solution

Add three new OTel base dimensions to ingestion.replication.record.delay:

Dimension Values Source
venice.region.locality local, remote Compare record source region vs serverConfig.getRegionName()
venice.partial_update.status partial_update_enabled, partial_update_disabled store.isWriteComputationEnabled()
venice.chunking.status chunked, unchunked version.isChunkingEnabled()

All three are static base dimensions (set once per store/version, like store name), so they add zero cardinality overhead.

Region locality is computed once per region at metric state creation time. Partial update and chunking status are looked up from the store metadata repository when the RecordLevelDelayOtelStats is lazily created.

Code changes

  • Added new code behind a config. These dimensions are part of the existing ingestion.replication.record.delay metric gated by server.record.level.timestamp.enabled and server.per.record.otel.metrics.enabled (both default: false). No new configs introduced.
  • Introduced new log lines.

Concurrency-Specific Checks

  • Code has no race conditions or thread safety issues. New dimensions are immutable base dimensions set at construction time.
  • Proper synchronization mechanisms are used where needed. No new synchronization needed.
  • No blocking calls inside critical sections.
  • Verified thread-safe collections are used. VeniceConcurrentHashMap for per-region metric states (unchanged).
  • Validated proper exception handling in multi-threaded code.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Unit tests: VenicePartialUpdateStatusTest, RecordLevelDelayOtelStatsTest (3 new: region locality, partial update, chunking), HeartbeatVersionedStatsTest, RecordLevelDelayOtelMetricEntityTest

E2E test: NearlineE2ELatencyTest.testRecordLevelDelaySloDimensions — verifies SLO dimensions through the full nearline ingestion pipeline

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

Add three new OTel dimensions to ingestion.replication.record.delay for
nearline write SLO tier classification:

- venice.region.locality (LOCAL/REMOTE): computed by comparing the
  record's source region against the server's local region
- venice.write_compute.status (WRITE_COMPUTE_ENABLED/DISABLED): from
  store config, distinguishes full PUT vs write compute workloads
- venice.chunking.status (CHUNKED/UNCHUNKED): from version config,
  identifies stores that handle large values

All three are static base dimensions (set once per store/version), so
they add zero cardinality overhead. These dimensions enable SLO queries
like: "P99 leader record delay for local-colo, non-WC, non-chunked
stores" without requiring external config JOINs at query time.
Rename VENICE_WRITE_COMPUTE_STATUS → VENICE_PARTIAL_UPDATE_STATUS and
VeniceWriteComputeStatus → VenicePartialUpdateStatus to use the
user-facing terminology (partial update) rather than the internal
implementation name (write compute).
Add testRecordLevelDelaySloDimensions to NearlineE2ELatencyTest that
verifies the new OTel dimensions (region locality, partial update
status, chunking status) are correctly emitted on the
ingestion.replication.record.delay histogram after nearline ingestion.
Copilot AI review requested due to automatic review settings April 14, 2026 19:03
Replace VenicePartialUpdateStatus (partial_update_enabled/disabled) with
VeniceStoreWriteType (regular_put/partial_update) for clearer SLO tier
naming. Dimension renamed from venice.partial_update.status to
venice.store.write_type.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds SLO-tier classification dimensions to the server-side OTel record-level delay metric (ingestion.replication.record.delay) so operators can slice latency/SLOs by locality, partial-update status, and chunking without joining against external store config.

Changes:

  • Extend RecordLevelDelayOtelStats / RecordLevelDelayOtelMetricEntity to emit new dimensions: venice.region.locality, venice.partial_update.status, venice.chunking.status.
  • Plumb local region + store metadata into HeartbeatVersionedStats so the delay metric can be tagged at stats creation time.
  • Add unit + integration/E2E tests validating the new dimensions are present on emitted histograms.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/endToEnd/NearlineE2ELatencyTest.java Adds an E2E assertion that at least one server emits the record-delay histogram with expected SLO dimensions.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/dimensions/VenicePartialUpdateStatusTest.java New unit test for the partial-update dimension enum wiring/value mapping.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VenicePartialUpdateStatus.java Introduces VenicePartialUpdateStatus dimension enum.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/dimensions/VeniceMetricsDimensions.java Registers the new VENICE_PARTIAL_UPDATE_STATUS dimension name.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStatsTest.java Updates/extends tests to validate added SLO dimensions on record-delay histograms.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelMetricEntityTest.java Updates metric-entity dimension expectations to include the new dimensions.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatVersionedStatsTest.java Adapts tests for new constructor args and validates record-delay dimension tagging.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelStats.java Adds locality/partial-update/chunking tagging via base dimensions + per-region base dimensions.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/RecordLevelDelayOtelMetricEntity.java Expands the metric entity’s required dimension set to include SLO dimensions.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatVersionedStats.java Supplies local region and store-derived flags into RecordLevelDelayOtelStats creation.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/stats/ingestion/heartbeat/HeartbeatMonitoringService.java Passes localRegionName into HeartbeatVersionedStats to enable locality computation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +245 to +247
Store store = metadataRepository.getStore(storeName);
boolean partialUpdateEnabled = store != null && store.isWriteComputationEnabled();
boolean chunkingEnabled = store != null && store.isChunkingEnabled();
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunkingEnabled for the SLO dimension is derived from store.isChunkingEnabled(), but the PR description states the source should be version.isChunkingEnabled(). Since chunking is a version-level attribute (and can diverge from the store-level flag for existing versions), this may misclassify the metric for CURRENT/FUTURE roles. Consider deriving chunking status from the relevant Version object(s) (e.g., current/future version) or otherwise align the implementation/description so the emitted dimension reflects the intended semantics.

Copilot uses AI. Check for mistakes.
Comment on lines +579 to +582
*/
@Test
public void testWriteComputeEnabledDimension() {
RecordLevelDelayOtelStats wcStats = new RecordLevelDelayOtelStats(
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test Javadoc mentions write_compute_status=write_compute_enabled, but the new dimension being asserted is venice.partial_update.status with values like partial_update_enabled/disabled. Please update the comment to match the actual dimension name/value semantics to avoid confusion when reading the test failures.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants