Skip to content

Specialized tsid layout for otel schemas#143955

Draft
dnhatn wants to merge 12 commits intoelastic:mainfrom
dnhatn:tsid-for-otel
Draft

Specialized tsid layout for otel schemas#143955
dnhatn wants to merge 12 commits intoelastic:mainfrom
dnhatn:tsid-for-otel

Conversation

@dnhatn
Copy link
Member

@dnhatn dnhatn commented Mar 10, 2026

I've been working on partitioning time series using TSID prefix bytes to enable parallel rate aggregation. The current layout is:

byte0      = hash(dimension_names)
byte1      = hash(value_0)
byte2      = hash(value_1)
byte3      = hash(value_2)
byte4      = hash(value_3)
bytes 5–20 = 128-bit hash of all dimension names and values (uniqueness)
---
Total: 21 bytes (1 name byte + up to 4 value bytes + 16 hash bytes).

Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and _metric_names_hash value. I explored several alternative layouts - with and without SimHash, various bit allocations (3+3+10, 8+8, pure SimHash, clustering bytes) - all leading to either storage or query regressions.

After profiling the doc access patterns, the root cause was: metrics interleaving. When different metrics interleave in sort order, queries must skip over irrelevant documents, increasing traversal cost. The key insight is that reserving a full byte for _metric_names_hash ensures all time series of the same metric are grouped contiguously - no interleaving. This improves compression and query performance since each query accesses exactly one contiguous slice of the segment.

This change introduces a specialized 16-byte tsid layout for OTel schemas (if the first dimension is _metric_names_hash):

byte0      = hash(_metric_names_hash value) — separates metric types
bytes 1–15 = 120-bit hash of all dimensions — uniqueness and within-metric ordering
---
Total: 16 bytes (2 longs).

Non-OTel schemas continue to use the current layout. Although we can also make it 16 bytes, that is for follow-up work.

This builds on Felix's work in #133706, bringing the OTel tsid down to a fixed 16 bytes, enabling future optimizations for hashing.

I will add an index version and tests. I'd like to open this up for discussion first. I have benchmarked this locally - no regression in storage and queries.

@dnhatn dnhatn changed the title Specialized TSID layout for OTel schemas Specialized tsid layout for OTel schemas Mar 10, 2026
@dnhatn dnhatn changed the title Specialized tsid layout for OTel schemas Specialized tsid layout for otel schemas Mar 10, 2026
@felixbarny
Copy link
Member

Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and _metric_names_hash value.

I wonder if this changes once a data stream has more variety of data. But maybe the _metric_names_hash good enough for those cases as well.

We should probably also do a similar thing for Prometheus data using labels.__name__.

@elastic elastic deleted a comment from elasticmachine Mar 11, 2026
@elastic elastic deleted a comment from elasticmachine Mar 11, 2026
@dnhatn
Copy link
Member Author

dnhatn commented Mar 12, 2026

tsdb-metricsgen-270m benchmarks

@dnhatn
Copy link
Member Author

dnhatn commented Mar 12, 2026

tsdb benchmarks

@elastic elastic deleted a comment from elasticmachine Mar 12, 2026
@dnhatn
Copy link
Member Author

dnhatn commented Mar 15, 2026

tsdb-metricsgen-270m benchmarks

@elastic elastic deleted a comment from elasticmachine Mar 15, 2026
@dnhatn
Copy link
Member Author

dnhatn commented Mar 15, 2026

Buildkite benchmark this with tsdb-metricsgen-270m please

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 15, 2026

💔 Build Failed

Failed CI Steps

This build attempts two tsdb-metricsgen-270m benchmarks to evaluate performance impact of this PR. To estimate benchmark completion time inspect previous nightly runs here.

History

@dnhatn
Copy link
Member Author

dnhatn commented Mar 15, 2026

@martijnvg @kkrik-es @felixbarny I am not sure we can change the TSID layout of a data stream where some old backing indices have the old TSID layout. Querying across backing indices with old and new layouts could return incorrect results - time-series boundaries in old indices won't match those in the new index. Only the first/last aggregation buckets at the index boundary will be affected. We could add an index-version marker to the data stream, but existing data streams would never get the benefit. I am not sure if we should accept the broken boundary aggregation buckets.

@dnhatn dnhatn requested review from kkrik-es and martijnvg March 15, 2026 05:53
@dnhatn dnhatn requested a review from felixbarny March 15, 2026 05:54
@kkrik-es
Copy link
Contributor

This should only affect unwrapped time series aggs without dimension bucketing - no common but possible..

I'm on the fence here. What kind of wins are we seeing? I wonder if we can get close with the backup plan of slicing by time.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 15, 2026

What kind of wins are we seeing?

@kkrik-es Partitioning the rate by TSID slices is significantly faster and cheaper than time-based slicing (based on a local benchmark). However, if we can't implement this change, we'll need to improve the time-based slicing for the rate.

@dnhatn
Copy link
Member Author

dnhatn commented Mar 15, 2026

@kkrik-es I am also working on dynamic partitioning, which may introduce more overhead than prefixes but does not require changes to the TSID layout.

@felixbarny
Copy link
Member

I think we'll have to find a way to be able to evolve the _tsid. We've done so in the past and haven't considered it a breaking change. But it's possible we were missing something.

In the past, we've made sure to add an index version gate around this so that a given backing index doesn't change the _tsid mid-game. This could be problematic as that would affect routing and therefore could lead to duplicates in different shards. But my understanding is that if you change the _tsid format in a new backing index, that should be mostly fine. It'll act as some kind of global counter reset. So there may be some impact for buckets that span across backing indices. But this shouldn't be worse than a regular counter reset.

* @throws IllegalArgumentException if no dimensions have been added
*/
public BytesRef buildTsid() {
public BytesRef buildLegacyTsid() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have several legacy tsid versions now (full key/values, hashed but way longer and based on index.routing_path, 17-21b hash with 1-5 clustering bytes and now a 16b hash with 1 clustering byte, the latter two based on index.dimensions). So let's try to use a more descriptive name.

return new BytesRef(hash, 0, index);
}

private BytesRef buildClusteringTsid() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version is also doing clustering, but in a bit of a different way, so maybe also find another name for the new buildClusteringTsid. Naming is hard...

* versions of Elasticsearch must continue to route based on the
* version on the index.
*/
assertIndexShard(fixture, Map.of("dim", Map.of("a", "a")), 7);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test that uses the previous values for shard routing the index version just before the new one you added (IndexVersionUtils.randomPreviousCompatibleVersion(IndexVersions.CLUSTERING_TSID)).

This is to ensure shard routing stays consistent for existing indices (see also the code comment block above)

@dnhatn
Copy link
Member Author

dnhatn commented Mar 16, 2026

In the past, we've made sure to add an index version gate around this so that a given backing index doesn't change the _tsid mid-game. This could be problematic as that would affect routing and therefore could lead to duplicates in different shards. But my understanding is that if you change the _tsid format in a new backing index, that should be mostly fine. It'll act as some kind of global counter reset. So there may be some impact for buckets that span across backing indices. But this shouldn't be worse than a regular counter reset.

Thanks Felix. This is what I've done in this PR - gating the new layout with a new index version. However, there is one problematic bucket at the boundary between the last backing index on the old layout and the first backing index on the new layout.

backing-index-0: 00:00 -> 02:30
backing-index-1: 02:30 -> 04:30
-- upgrade --
backing-index-2: 04:30 -> 06:30

Querying with TBUCKET(1h) across these backing indices, buckets at 00:00, 01:00, 02:00, 03:00, 05:00, and 06:00 all work correctly. The only problematic bucket is 04:00: it spans backing-index-1 [04:00 -> 04:30) and backing-index-2 [04:30 -> 05:00), but because the _tsid layout differs between them, they produce different (tsid, time-bucket) pairs and are treated as two separate time series instead of one. So for example, SUM(LAST_OVER_TIME(x)) BY TBUCKET(1h), cluster may double-count that one bucket. This is a one-time issue at the upgrade boundary.

I think we'll have to find a way to be able to evolve the _tsid.

++ I think we should formalize this.

@felixbarny
Copy link
Member

Thanks, that's a good example. I think we should declare this as an acceptable behavior. But I agree we need to formalize it and potentially document it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants