Skip to content

Conversation

@dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Dec 3, 2025

What changes were proposed in this pull request?

This PR adds comprehensive documentation for Spark SQL's sketch-based approximate functions powered by the Apache DataSketches library. The new documentation page (sql-ref-sketch-aggregates.md) covers:

Function Reference:

  • HyperLogLog (HLL) Sketch Functions: hll_sketch_agg, hll_union_agg, hll_sketch_estimate, hll_union
  • Theta Sketch Functions: theta_sketch_agg, theta_union_agg, theta_intersection_agg, theta_sketch_estimate, theta_union, theta_intersection, theta_difference
  • KLL Quantile Sketch Functions: kll_sketch_agg_*, kll_sketch_to_string_*, kll_sketch_get_n_*, kll_sketch_merge_*, kll_sketch_get_quantile_*, kll_sketch_get_rank_*
  • Approximate Top-K Functions: approx_top_k_accumulate, approx_top_k_combine, approx_top_k_estimate

Best Practices:

  • Guidance on choosing between HLL and Theta sketches
  • Accuracy vs. memory trade-offs for each sketch type
  • Tips for storing and reusing sketches

Common Use Cases and Examples:

  • Tracking daily unique users with HLL sketches (ETL workflow)
  • Computing percentiles over time with KLL sketches
  • Set operations with Theta sketches (intersection, difference for cohort analysis)
  • Finding trending items with Top-K sketches

The PR also adds links to this new documentation page from:

  • sql-ref-functions.md (under Aggregate-like Functions)
  • sql-ref.md (under Functions section)
  • _data/menu-sql.yaml (navigation menu)

Why are the changes needed?

Spark SQL has added several sketch-based approximate functions using the Apache DataSketches library (HLL sketches in 3.5.0, Theta/KLL/Top-K sketches in 4.1.0), but there was no comprehensive documentation explaining:

  • How to use these functions together in practical ETL workflows
  • How to store sketches and merge them across multiple data batches
  • Best practices for choosing the right sketch type and tuning accuracy parameters

This documentation fills that gap and helps users understand the full power of sketch-based analytics in Spark SQL.

Does this PR introduce any user-facing change?

Yes, this PR adds new documentation pages that are user-facing. No code changes are included.

How was this patch tested?

Documentation-only change. The examples were verified against the existing function implementations and test cases in the codebase.

Was this patch authored or co-authored using generative AI tooling?

Yes, code assistance with claude-4.5-opus-high in combination with manual editing by the author.

@github-actions github-actions bot added the DOCS label Dec 3, 2025
@dtenedor
Copy link
Contributor Author

dtenedor commented Dec 3, 2025

cc @cboumalh I added some Spark documentation for the new Datasketches based aggregate functions we have so far. We can maybe keep extending this as well as we add new functions later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant