Skip to content

[Accurate counter PoC_v2] Aggregate counter downsampling using first value and reset docs.#142280

Closed
gmarouli wants to merge 24 commits intoelastic:mainfrom
gmarouli:aggregate-counter-first-and-reset-docs
Closed

[Accurate counter PoC_v2] Aggregate counter downsampling using first value and reset docs.#142280
gmarouli wants to merge 24 commits intoelastic:mainfrom
gmarouli:aggregate-counter-first-and-reset-docs

Conversation

@gmarouli
Copy link
Contributor

@gmarouli gmarouli commented Feb 11, 2026

This is an alternative approach to #140360.

In this approach we aim to improve the accuracy of the aggregate counter by the following changes:

  • The downsampled document will record the first and not the last value of the counter. We hope to improve the accuracy here because the first value should be closer to the start of the bucket.
  • If we detect a reset, we add two more documents, the last value before the reset and the reset value. We preserve the timestamps here.

Our hypothesis is that with these two changes, we can have a more accurate counter estimation without a big performance regression, assuming that reset events are rare and usually affect all counters at the same moment.

This PR is built on top of refactoring #140357 because it makes it a much easier change.

@gmarouli
Copy link
Contributor Author

Buildkite benchmark this with tsdb please

@gmarouli gmarouli changed the title [PoC] Aggregate counter downsampling using first value and reset docs. [Accurate counter PoC_v2] Aggregate counter downsampling using first value and reset docs. Feb 12, 2026
@gmarouli
Copy link
Contributor Author

Buildkite benchmark this with tsdb please

@gmarouli
Copy link
Contributor Author

TSDB Benchmark

Aggregate

# Baseline
Shard [[tsdb][0]] successfully sent [116633696], received source doc [7089492], indexed downsampled doc [7089492], failed [0], took [3.7m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [229256], indexed downsampled doc [229256], failed [0], took [1.8m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [116859], indexed downsampled doc [116859], failed [0], took [1.7m]

# Conteder
Shard [[tsdb][0]] successfully sent [116633696], received source doc [7090709], indexed downsampled doc [7090709], failed [0], took [4.2m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [231486], indexed downsampled doc [231486], failed [0], took [2.1m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [119308], indexed downsampled doc [119308], failed [0], took [2m]

Last value

# Baseline
Shard [[tsdb][0]] successfully sent [116633696], received source doc [7089492], indexed downsampled doc [7089492], failed [0], took [4.1m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [229256], indexed downsampled doc [229256], failed [0], took [2.1m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [116859], indexed downsampled doc [116859], failed [0], took [2m]

# Contender
Shard [[tsdb][0]] successfully sent [116633696], received source doc [7089492], indexed downsampled doc [7089492], failed [0], took [4m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [229256], indexed downsampled doc [229256], failed [0], took [2m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [116859], indexed downsampled doc [116859], failed [0], took [1.9m]

@gmarouli
Copy link
Contributor Author

Buildkite benchmark this with tsdb please

@elasticmachine
Copy link
Collaborator

elasticmachine commented Feb 13, 2026

💚 Build Succeeded

This build ran two tsdb benchmarks to evaluate performance impact of this PR.

History

@gmarouli
Copy link
Contributor Author

TSDB Benchmark

Aggregate

# Baseline
Shard [[tsdb][0]] successfully sent [116633696], received source doc [7089492], indexed downsampled doc [7089492], failed [0], took [4.1m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [229256], indexed downsampled doc [229256], failed [0], took [2m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [116859], indexed downsampled doc [116859], failed [0], took [1.9m]

# Conteder
Shard [[tsdb][0]] successfully sent [116633696], received source doc [7090709], indexed downsampled doc [7090709], failed [0], took [4.5m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [231486], indexed downsampled doc [231486], failed [0], took [2.4m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [119308], indexed downsampled doc [119308], failed [0], took [2.2m]

Last value

# Baseline
Shard [[tsdb][0]] successfully sent [116633696], received source doc [7089492], indexed downsampled doc [7089492], failed [0], took [4.1m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [229256], indexed downsampled doc [229256], failed [0], took [2.1m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [116859], indexed downsampled doc [116859], failed [0], took [2m]

# Contender
[Shard [[tsdb][0]] successfully sent [116633696], received source doc [7089492], indexed downsampled doc [7089492], failed [0], took [3.9m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [229256], indexed downsampled doc [229256], failed [0], took [2m]
Shard [[tsdb][0]] successfully sent [116633696], received source doc [116859], indexed downsampled doc [116859], failed [0], took [1.8m]

@gmarouli
Copy link
Contributor Author

gmarouli commented Mar 2, 2026

PoC ended, #143381 is the result.

@gmarouli gmarouli closed this Mar 2, 2026
gmarouli added a commit that referenced this pull request Mar 13, 2026
In this PR we aim to improve the accuracy of the aggregate counter by the following changes:

- The downsampled document will record the first and not the last value of the counter. This should improve accuracy because the first value is closer to the start of the bucket than the last value.
- If we detect a reset,  we track extra documents, the last value before the reset and, optionally, the value after the reset. These documents will preserve the original timestamps.

Our hypothesis is that with these two changes, we can have a more accurate counter estimation without a big performance regression (vefiried in #142280), assuming that reset events are rare and usually affect all counters at the same moment.

Closes #136178
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants