Specialized tsid layout for otel schemas by dnhatn · Pull Request #143955 · elastic/elasticsearch

dnhatn · 2026-03-10T15:58:25Z

I've been working on partitioning time series using TSID prefix bytes to enable parallel rate aggregation. The current layout is:

byte0      = hash(dimension_names)
byte1      = hash(value_0)
byte2      = hash(value_1)
byte3      = hash(value_2)
byte4      = hash(value_3)
bytes 5–20 = 128-bit hash of all dimension names and values (uniqueness)
---
Total: 21 bytes (1 name byte + up to 4 value bytes + 16 hash bytes).

Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and _metric_names_hash value. I explored several alternative layouts - with and without SimHash, various bit allocations (3+3+10, 8+8, pure SimHash, clustering bytes) - all leading to either storage or query regressions.

After profiling the doc access patterns, the root cause was: metrics interleaving. When different metrics interleave in sort order, queries must skip over irrelevant documents, increasing traversal cost. The key insight is that reserving a full byte for _metric_names_hash ensures all time series of the same metric are grouped contiguously - no interleaving. This improves compression and query performance since each query accesses exactly one contiguous slice of the segment.

This change introduces a specialized 16-byte tsid layout for OTel schemas (if the first dimension is _metric_names_hash):

byte0      = hash(_metric_names_hash value) — separates metric types
bytes 1–15 = 120-bit hash of all dimensions — uniqueness and within-metric ordering
---
Total: 16 bytes (2 longs).

Non-OTel schemas continue to use the current layout. Although we can also make it 16 bytes, that is for follow-up work.

This builds on Felix's work in #133706, bringing the OTel tsid down to a fixed 16 bytes, enabling future optimizations for hashing.

I will add an index version and tests. I'd like to open this up for discussion first. I have benchmarked this locally - no regression in storage and queries.

felixbarny · 2026-03-10T16:42:48Z

Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and _metric_names_hash value.

I wonder if this changes once a data stream has more variety of data. But maybe the _metric_names_hash good enough for those cases as well.

We should probably also do a similar thing for Prometheus data using labels.__name__.

dnhatn · 2026-03-12T03:11:32Z

tsdb-metricsgen-270m benchmarks

Buildkite Build
Commit: 90139b1
Baseline: 4dd805b (env ID 3e04e22a-84f9-448e-8e84-2518f5342bc9)
Contender: 90139b1 (env ID 8c84b43a-8f33-4feb-a790-fa32852b1871)
Benchmark results

dnhatn · 2026-03-12T16:48:05Z

tsdb benchmarks

Buildkite Build
Commit: 90139b1
Baseline: 4dd805b (env ID cdcbf5de-30c0-4b5c-b53f-ec7d2c0d1942)
Contender: 90139b1 (env ID e0c5d255-5839-4450-b0db-73bfbe0bb235)
Benchmark results

dnhatn · 2026-03-15T04:01:14Z

tsdb-metricsgen-270m benchmarks

Buildkite Build
Commit: 82f8e13
Baseline: 4ebb748 (env ID b06bcd65-ed79-4fb4-a0b7-191b339d0407)
Contender: 82f8e13 (env ID e4c4c978-0d1e-4efa-a8ec-b831ba3f3b4e)
Benchmark results

dnhatn · 2026-03-15T04:01:49Z

Buildkite benchmark this with tsdb-metricsgen-270m please

elasticmachine · 2026-03-15T04:05:22Z

💔 Build Failed

Buildkite Build
Commit: 89ca77f
Baseline: ced6c20 (env ID 61e8666d-0cdb-4034-a714-08981a953c00)
Contender: 89ca77f (env ID 10f484b0-aa35-4e7b-ba49-937cd3f14d8f)

Failed CI Steps

This build attempts two tsdb-metricsgen-270m benchmarks to evaluate performance impact of this PR. To estimate benchmark completion time inspect previous nightly runs here.

History

dnhatn · 2026-03-15T05:53:41Z

@martijnvg @kkrik-es @felixbarny I am not sure we can change the TSID layout of a data stream where some old backing indices have the old TSID layout. Querying across backing indices with old and new layouts could return incorrect results - time-series boundaries in old indices won't match those in the new index. Only the first/last aggregation buckets at the index boundary will be affected. We could add an index-version marker to the data stream, but existing data streams would never get the benefit. I am not sure if we should accept the broken boundary aggregation buckets.

kkrik-es · 2026-03-15T12:46:05Z

This should only affect unwrapped time series aggs without dimension bucketing - no common but possible..

I'm on the fence here. What kind of wins are we seeing? I wonder if we can get close with the backup plan of slicing by time.

dnhatn · 2026-03-15T16:05:51Z

What kind of wins are we seeing?

@kkrik-es Partitioning the rate by TSID slices is significantly faster and cheaper than time-based slicing (based on a local benchmark). However, if we can't implement this change, we'll need to improve the time-based slicing for the rate.

dnhatn · 2026-03-15T23:38:18Z

@kkrik-es I am also working on dynamic partitioning, which may introduce more overhead than prefixes but does not require changes to the TSID layout.

felixbarny · 2026-03-16T17:18:52Z

I think we'll have to find a way to be able to evolve the _tsid. We've done so in the past and haven't considered it a breaking change. But it's possible we were missing something.

In the past, we've made sure to add an index version gate around this so that a given backing index doesn't change the _tsid mid-game. This could be problematic as that would affect routing and therefore could lead to duplicates in different shards. But my understanding is that if you change the _tsid format in a new backing index, that should be mostly fine. It'll act as some kind of global counter reset. So there may be some impact for buckets that span across backing indices. But this shouldn't be worse than a regular counter reset.

felixbarny · 2026-03-16T17:27:44Z

server/src/main/java/org/elasticsearch/cluster/routing/TsidBuilder.java

     * @throws IllegalArgumentException if no dimensions have been added
     */
-    public BytesRef buildTsid() {
+    public BytesRef buildLegacyTsid() {


We have several legacy tsid versions now (full key/values, hashed but way longer and based on index.routing_path, 17-21b hash with 1-5 clustering bytes and now a 16b hash with 1 clustering byte, the latter two based on index.dimensions). So let's try to use a more descriptive name.

felixbarny · 2026-03-16T17:30:15Z

server/src/main/java/org/elasticsearch/cluster/routing/TsidBuilder.java

        return new BytesRef(hash, 0, index);
    }

+    private BytesRef buildClusteringTsid() {


The previous version is also doing clustering, but in a bit of a different way, so maybe also find another name for the new buildClusteringTsid. Naming is hard...

felixbarny · 2026-03-16T17:35:01Z

server/src/test/java/org/elasticsearch/cluster/routing/IndexRoutingTests.java

         * versions of Elasticsearch must continue to route based on the
         * version on the index.
         */
-        assertIndexShard(fixture, Map.of("dim", Map.of("a", "a")), 7);


Let's add a test that uses the previous values for shard routing the index version just before the new one you added (IndexVersionUtils.randomPreviousCompatibleVersion(IndexVersions.CLUSTERING_TSID)).

This is to ensure shard routing stays consistent for existing indices (see also the code comment block above)

dnhatn · 2026-03-16T17:41:13Z

In the past, we've made sure to add an index version gate around this so that a given backing index doesn't change the _tsid mid-game. This could be problematic as that would affect routing and therefore could lead to duplicates in different shards. But my understanding is that if you change the _tsid format in a new backing index, that should be mostly fine. It'll act as some kind of global counter reset. So there may be some impact for buckets that span across backing indices. But this shouldn't be worse than a regular counter reset.

Thanks Felix. This is what I've done in this PR - gating the new layout with a new index version. However, there is one problematic bucket at the boundary between the last backing index on the old layout and the first backing index on the new layout.

backing-index-0: 00:00 -> 02:30
backing-index-1: 02:30 -> 04:30
-- upgrade --
backing-index-2: 04:30 -> 06:30

Querying with TBUCKET(1h) across these backing indices, buckets at 00:00, 01:00, 02:00, 03:00, 05:00, and 06:00 all work correctly. The only problematic bucket is 04:00: it spans backing-index-1 [04:00 -> 04:30) and backing-index-2 [04:30 -> 05:00), but because the _tsid layout differs between them, they produce different (tsid, time-bucket) pairs and are treated as two separate time series instead of one. So for example, SUM(LAST_OVER_TIME(x)) BY TBUCKET(1h), cluster may double-count that one bucket. This is a one-time issue at the upgrade boundary.

I think we'll have to find a way to be able to evolve the _tsid.

++ I think we should formalize this.

felixbarny · 2026-03-16T17:45:28Z

Thanks, that's a good example. I think we should declare this as an acceptable behavior. But I agree we need to formalize it and potentially document it.

elasticsearchmachine added the v9.4.0 label Mar 10, 2026

dnhatn changed the title ~~Specialized TSID layout for OTel schemas~~ Specialized tsid layout for OTel schemas Mar 10, 2026

dnhatn changed the title ~~Specialized tsid layout for OTel schemas~~ Specialized tsid layout for otel schemas Mar 10, 2026

dnhatn force-pushed the tsid-for-otel branch from 7fb1ea8 to 3bfefad Compare March 10, 2026 21:04

elastic deleted a comment from elasticmachine Mar 11, 2026

Specialized TSID layout for OTel schemas

4dca928

dnhatn force-pushed the tsid-for-otel branch from 3bfefad to 4dca928 Compare March 11, 2026 01:32

elastic deleted a comment from elasticmachine Mar 11, 2026

dnhatn added 3 commits March 11, 2026 10:07

General clustering byte

7d4a973

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

90139b1

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

e74d73c

elastic deleted a comment from elasticmachine Mar 12, 2026

dnhatn added 8 commits March 12, 2026 14:21

handle bwc

ceb4615

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

091947c

handle bwc

2eb95be

skip otlp

ece1029

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

48a1853

fix sort

82f8e13

fix otlp

f429165

Merge remote-tracking branch 'elastic/main' into tsid-for-otel

89ca77f

elastic deleted a comment from elasticmachine Mar 15, 2026

dnhatn requested review from kkrik-es and martijnvg March 15, 2026 05:53

dnhatn requested a review from felixbarny March 15, 2026 05:54

felixbarny reviewed Mar 16, 2026

View reviewed changes

Conversation

dnhatn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixbarny commented Mar 10, 2026

Uh oh!

dnhatn commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tsdb-metricsgen-270m benchmarks

Uh oh!

dnhatn commented Mar 12, 2026

tsdb benchmarks

Uh oh!

dnhatn commented Mar 15, 2026

tsdb-metricsgen-270m benchmarks

Uh oh!

dnhatn commented Mar 15, 2026

Uh oh!

elasticmachine commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💔 Build Failed

Failed CI Steps

History

Uh oh!

dnhatn commented Mar 15, 2026

Uh oh!

kkrik-es commented Mar 15, 2026

Uh oh!

dnhatn commented Mar 15, 2026

Uh oh!

dnhatn commented Mar 15, 2026

Uh oh!

felixbarny commented Mar 16, 2026

Uh oh!

felixbarny Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

felixbarny Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

felixbarny Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Mar 16, 2026

Uh oh!

felixbarny commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dnhatn commented Mar 10, 2026 •

edited

Loading

dnhatn commented Mar 12, 2026 •

edited

Loading

elasticmachine commented Mar 15, 2026 •

edited

Loading