Specialized tsid layout for otel schemas#143955
Specialized tsid layout for otel schemas#143955dnhatn wants to merge 12 commits intoelastic:mainfrom
Conversation
I wonder if this changes once a data stream has more variety of data. But maybe the We should probably also do a similar thing for Prometheus data using |
tsdb-metricsgen-270m benchmarks
|
tsdb benchmarks
|
tsdb-metricsgen-270m benchmarks
|
|
Buildkite benchmark this with tsdb-metricsgen-270m please |
💔 Build Failed
Failed CI StepsThis build attempts two tsdb-metricsgen-270m benchmarks to evaluate performance impact of this PR. To estimate benchmark completion time inspect previous nightly runs here. History |
|
@martijnvg @kkrik-es @felixbarny I am not sure we can change the TSID layout of a data stream where some old backing indices have the old TSID layout. Querying across backing indices with old and new layouts could return incorrect results - time-series boundaries in old indices won't match those in the new index. Only the first/last aggregation buckets at the index boundary will be affected. We could add an index-version marker to the data stream, but existing data streams would never get the benefit. I am not sure if we should accept the broken boundary aggregation buckets. |
|
This should only affect unwrapped time series aggs without dimension bucketing - no common but possible.. I'm on the fence here. What kind of wins are we seeing? I wonder if we can get close with the backup plan of slicing by time. |
@kkrik-es Partitioning the rate by TSID slices is significantly faster and cheaper than time-based slicing (based on a local benchmark). However, if we can't implement this change, we'll need to improve the time-based slicing for the rate. |
|
@kkrik-es I am also working on dynamic partitioning, which may introduce more overhead than prefixes but does not require changes to the TSID layout. |
|
I think we'll have to find a way to be able to evolve the _tsid. We've done so in the past and haven't considered it a breaking change. But it's possible we were missing something. In the past, we've made sure to add an index version gate around this so that a given backing index doesn't change the _tsid mid-game. This could be problematic as that would affect routing and therefore could lead to duplicates in different shards. But my understanding is that if you change the _tsid format in a new backing index, that should be mostly fine. It'll act as some kind of global counter reset. So there may be some impact for buckets that span across backing indices. But this shouldn't be worse than a regular counter reset. |
| * @throws IllegalArgumentException if no dimensions have been added | ||
| */ | ||
| public BytesRef buildTsid() { | ||
| public BytesRef buildLegacyTsid() { |
There was a problem hiding this comment.
We have several legacy tsid versions now (full key/values, hashed but way longer and based on index.routing_path, 17-21b hash with 1-5 clustering bytes and now a 16b hash with 1 clustering byte, the latter two based on index.dimensions). So let's try to use a more descriptive name.
| return new BytesRef(hash, 0, index); | ||
| } | ||
|
|
||
| private BytesRef buildClusteringTsid() { |
There was a problem hiding this comment.
The previous version is also doing clustering, but in a bit of a different way, so maybe also find another name for the new buildClusteringTsid. Naming is hard...
| * versions of Elasticsearch must continue to route based on the | ||
| * version on the index. | ||
| */ | ||
| assertIndexShard(fixture, Map.of("dim", Map.of("a", "a")), 7); |
There was a problem hiding this comment.
Let's add a test that uses the previous values for shard routing the index version just before the new one you added (IndexVersionUtils.randomPreviousCompatibleVersion(IndexVersions.CLUSTERING_TSID)).
This is to ensure shard routing stays consistent for existing indices (see also the code comment block above)
Thanks Felix. This is what I've done in this PR - gating the new layout with a new index version. However, there is one problematic bucket at the boundary between the last backing index on the old layout and the first backing index on the new layout. Querying with
++ I think we should formalize this. |
|
Thanks, that's a good example. I think we should declare this as an acceptable behavior. But I agree we need to formalize it and potentially document it. |
I've been working on partitioning time series using TSID prefix bytes to enable parallel rate aggregation. The current layout is:
Partitioning on the first two prefix bytes yields only one effective partition for OTel data because all time series share the same dimension names and
_metric_names_hashvalue. I explored several alternative layouts - with and without SimHash, various bit allocations (3+3+10, 8+8, pure SimHash, clustering bytes) - all leading to either storage or query regressions.After profiling the doc access patterns, the root cause was: metrics interleaving. When different metrics interleave in sort order, queries must skip over irrelevant documents, increasing traversal cost. The key insight is that reserving a full byte for
_metric_names_hashensures all time series of the same metric are grouped contiguously - no interleaving. This improves compression and query performance since each query accesses exactly one contiguous slice of the segment.This change introduces a specialized 16-byte tsid layout for OTel schemas (if the first dimension is
_metric_names_hash):Non-OTel schemas continue to use the current layout. Although we can also make it 16 bytes, that is for follow-up work.
This builds on Felix's work in #133706, bringing the OTel tsid down to a fixed 16 bytes, enabling future optimizations for hashing.
I will add an index version and tests. I'd like to open this up for discussion first. I have benchmarked this locally - no regression in storage and queries.