[SPARK-54576][SQL] Add documentation for new Datasketches-based aggregate functions #53297
+942
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds comprehensive documentation for Spark SQL's sketch-based approximate functions powered by the Apache DataSketches library. The new documentation page (
sql-ref-sketch-aggregates.md) covers:Function Reference:
hll_sketch_agg,hll_union_agg,hll_sketch_estimate,hll_uniontheta_sketch_agg,theta_union_agg,theta_intersection_agg,theta_sketch_estimate,theta_union,theta_intersection,theta_differencekll_sketch_agg_*,kll_sketch_to_string_*,kll_sketch_get_n_*,kll_sketch_merge_*,kll_sketch_get_quantile_*,kll_sketch_get_rank_*approx_top_k_accumulate,approx_top_k_combine,approx_top_k_estimateBest Practices:
Common Use Cases and Examples:
The PR also adds links to this new documentation page from:
sql-ref-functions.md(under Aggregate-like Functions)sql-ref.md(under Functions section)_data/menu-sql.yaml(navigation menu)Why are the changes needed?
Spark SQL has added several sketch-based approximate functions using the Apache DataSketches library (HLL sketches in 3.5.0, Theta/KLL/Top-K sketches in 4.1.0), but there was no comprehensive documentation explaining:
This documentation fills that gap and helps users understand the full power of sketch-based analytics in Spark SQL.
Does this PR introduce any user-facing change?
Yes, this PR adds new documentation pages that are user-facing. No code changes are included.
How was this patch tested?
Documentation-only change. The examples were verified against the existing function implementations and test cases in the codebase.
Was this patch authored or co-authored using generative AI tooling?
Yes, code assistance with
claude-4.5-opus-highin combination with manual editing by the author.