[SPARK-52798] [SQL] Add function approx_top_k_combine #51505
+769
−60
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds a SQL function:
approx_top_k_accumulate
, an aggregation function that merges multiple sketches into a single sketch.Syntax
Arguments
expr
: An expression of sketch structsmaxItemsTracked
: An optional INTEGER literal. If maxItemsTracked is specified, use this value for the newly generated combined sketch. If maxItemsTracked is not specified, all input sketches must have the same maxItemsTracked, and the output sketch would use the same value as well.Returns
The return of this function is a STRUCT with four fields: sketch, itemDataType, maxItemsTracked and typeCode. The return is exactly the same as for approx_top_k_accumulate.
Why are the changes needed?
They are useful sibling functions for approx_top_k queries.
Does this PR introduce any user-facing change?
Yes, this PR introduces a new user-facing SQL function. See user examples as below.
How was this patch tested?
Unit tests for end-to-end SQL queries and invalid input for expressions.
Was this patch authored or co-authored using generative AI tooling?