Skip to content

Use sketches-ddsketch fork with Java-compatible binary encoding#2842

Merged
trinity-1686a merged 12 commits intomainfrom
congxie/replaceHll
Feb 20, 2026
Merged

Use sketches-ddsketch fork with Java-compatible binary encoding#2842
trinity-1686a merged 12 commits intomainfrom
congxie/replaceHll

Conversation

@congx4
Copy link
Collaborator

@congx4 congx4 commented Feb 18, 2026

Description

Reference the quickwit-oss/rust-sketches-ddsketch fork (via git + rev in Cargo.toml) which adds native Java-compatible binary serialization for DDSketch. This enables Rust applications to produce DDSketch bytes that Java consumers can directly deserialize and merge via DDSketchWithExactSummaryStatistics.decode() from the sketches-java library.

Why?

The upstream sketches-ddsketch Rust crate only supports serde-based serialization (JSON), while Java's sketches-java library uses a custom binary wire format. For distributed aggregation pipelines where Rust search nodes produce intermediate DDSketch results consumed by Java query orchestrators, binary compatibility is required.

What changed

  • Removed the vendored sketches-ddsketch/ directory from the tantivy workspace
  • Changed the sketches-ddsketch dependency from path = "./sketches-ddsketch" to git + rev pointing to quickwit-oss/rust-sketches-ddsketch@555caf1
  • The fork contains:
    • encoding.rs: Java-compatible binary encode/decode using signed/unsigned varint and VarDouble encoding
    • Cross-language golden-byte tests verifying binary compatibility with Java's DDSketch
    • Config/Store/DDSketch modifications to support encode()/decode() methods

Testing

  • All existing tantivy aggregation tests pass (percentiles, cardinality, etc.)
  • All sketches-ddsketch tests pass including cross-language golden-byte tests

congx4 and others added 7 commits February 18, 2026 11:36
Fork sketches-ddsketch as a workspace member to add native Java binary
serialization (to_java_bytes/from_java_bytes) for DDSketch. This enables
pomsky to return raw DDSketch bytes that event-query can deserialize via
DDSketchWithExactSummaryStatistics.decode().

Key changes:
- Vendor sketches-ddsketch crate with encoding.rs implementing VarEncoding,
  flag bytes, and INDEX_DELTAS_AND_COUNTS store format
- Align Config::key() to floor-based indexing matching Java's LogarithmicMapping
- Add PercentilesCollector::to_sketch_bytes() for pomsky integration
- Cross-language golden byte tests verified byte-identical with Java output

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Reference the exact Java source files in DataDog/sketches-java for
Config::new(), Config::key(), Config::value(), Config::from_gamma(),
and Store::add_count() so readers can verify the alignment.

Co-authored-by: Cursor <cursoragent@cursor.com>
- manual_range_contains: use !(0.0..=1.0).contains(&q)
- identity_op: simplify (0 << 2) | FLAG_TYPE to just FLAG_TYPE
- manual_clamp: use .clamp(0, 8) instead of .max(0).min(8)
- manual_repeat_n: use repeat_n() instead of repeat().take()
- cast_abs_to_unsigned: use .unsigned_abs() instead of .abs() as usize

Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace bare constants with FlagType and BinEncodingMode enums
- Use const fn for flag byte construction instead of raw bit ops
- Replace if-else chain with nested match in decode_from_java_bytes
- Use split_first() in read_byte for idiomatic slice consumption
- Use split_at in read_f64_le to avoid TryInto on edition 2018
- Use u64::from(next) instead of `next as u64` casts
- Extract assert_golden, assert_quantiles_match, bytes_to_hex helpers
  to reduce duplication across golden byte tests
- Fix edition-2018 assert! format string compatibility
- Clean up is_valid_flag_byte with let-else and match

Co-authored-by: Cursor <cursoragent@cursor.com>
- Replace approximate PI/E constants with non-famous value in test
- Fix reversed empty range (2048..0) → (0..2048).rev() in store test

Co-authored-by: Cursor <cursoragent@cursor.com>
use crate::ddsketch::DDSketch;
use crate::store::Store;

// ---------------------------------------------------------------------------
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is generated by Opus and it generates tests to ensure it is fully correct. See more details from the pr description.

@PSeitz
Copy link
Collaborator

PSeitz commented Feb 19, 2026

I don't think this should be in the tantivy repo.

The changes should be in your sketches-ddsketch fork, where we try to upstream it.

@fulmicoton-dd
Copy link
Collaborator

@congx4 I am not sure @PSeitz comment was clear. Can you just point to your repo with {git =, rev=}

@congx4
Copy link
Collaborator Author

congx4 commented Feb 19, 2026

@congx4 I am not sure @PSeitz comment was clear. Can you just point to your repo with {git =, rev=}

Thanks, but looks like we don't have permissions to folk it under Datadog, I will just folk it under quickwit-oss.

Move the vendored sketches-ddsketch crate (with Java-compatible binary
encoding) to its own repo at quickwit-oss/rust-sketches-ddsketch and
reference it via git+rev in Cargo.toml.

Co-authored-by: Cursor <cursoragent@cursor.com>
@congx4 congx4 changed the title Vendor sketches-ddsketch with Java-compatible binary encoding Use sketches-ddsketch fork with Java-compatible binary encoding Feb 19, 2026
Copy link
Collaborator

@fulmicoton-dd fulmicoton-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but please have a look at the comments and see if they make sense

congx4 and others added 3 commits February 19, 2026 13:21
Address review feedback: replace assert_eq! with assert_nearly_equals!
for float values that go through JSON serialization roundtrips, which
can introduce minor precision differences.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ctor

Replace the derived Serialize/Deserialize on PercentilesCollector with
custom impls that use DDSketch's Java-compatible binary encoding
(encode_to_java_bytes / decode_from_java_bytes). This removes the need
for the use_serde feature on sketches-ddsketch entirely.

Also restore original float test values and use assert_nearly_equals!
for all float comparisons in percentile tests, since DDSketch quantile
estimates can have minor precision differences across platforms.

Co-authored-by: Cursor <cursoragent@cursor.com>
Keep use_serde on sketches-ddsketch so DDSketch derives
Serialize/Deserialize, removing the need for custom impls
on PercentilesCollector.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@congx4 congx4 force-pushed the congxie/replaceHll branch 2 times, most recently from 9f764cb to 18fedd9 Compare February 20, 2026 02:07
@trinity-1686a trinity-1686a merged commit d0c5ffb into main Feb 20, 2026
12 checks passed
@trinity-1686a trinity-1686a deleted the congxie/replaceHll branch February 20, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants