-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-52407][SQL] Add support for Theta Sketch #51298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-52407][SQL] Add support for Theta Sketch #51298
Conversation
b301ac8
to
e9a2cbe
Compare
I know there is probably a lot missing in the PR before it's ready to merge. Please let me know! |
…n examples in ExpressionDescription and unit tests
…and python builtin.py
Below are some of the results I came up with. This run was using spark shell on a c6a.2xlarge instance.
Results comparison by group100000 entries (size 20):
100000 entries (size 12):
500000 entries (size 20):
500000 entries (size 12):
1000000 entries (size 20):
1000000 entries (size 12):
5000000 entries (size 20):
5000000 entries (size 12):
10000000 entries (size 20):
10000000 entries (size 12):
20000000 entries (size 20):
20000000 entries (size 12):
Summary and commentsTheta is competitive. It wins or is very close in many cases. Where HLL dominates is with larger sketch sizes, consistently 3-5x faster than Theta. Theta's major weakness is that it becomes significantly slower at higher precision, but on the plus side it is very accurate. Choose Theta sketches when: You need set operations (intersection, difference, union), size 12 is acceptable for your accuracy needs, and you prioritize functionality over raw speed. https://gist.github.com/cboumalh/81ae7e1362a35d2a9968184fa1a52e0d |
LGTM, but I don't have permission to approve it. @dongjoon-hyun Can you please help find someone to review this PR? Thanks! |
I can help review this, I reviewed the original HLL sketch aggregate function contributions by Ryan Berti earlier. |
...lyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/thetasketchesExpressions.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ThetaSketchUtils.scala
Outdated
Show resolved
Hide resolved
sql/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good, thanks again for putting in all the work to make this implementation available for Spark!!
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ThetaSketchUtilsSuite.scala
Outdated
Show resolved
Hide resolved
...lyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/thetasketchesExpressions.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking almost done, just a few comments left and we can merge it soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed everything again and the functionality looks clear and thoroughly-tested. We can leave the PR open for another day or so if anyone else has any concerns, then proceed to merge it.
Sounds good! Thanks a lot for the thorough review. I’ll be available in case others want to chime in, and I’m happy to address any last-minute feedback. Really appreciate your help moving this forward and excited to make this feature available for the Spark community. |
...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala
Outdated
Show resolved
Hide resolved
...main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/thetasketchesAggregates.scala
Outdated
Show resolved
Hide resolved
...c/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ThetasketchesAggSuite.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cboumalh Thank you for addressing the review comments!
One more quick round of review. Please take a look.
I intend to do another pass.
What changes were proposed in this pull request?
This PR's goal is to provide 7 new functions which utilize Theta Sketches.
The functions are:
Why are the changes needed?
Today, Spark supports HyperLogLog. HLL only supports the union set operation. Theta on the other hand allows us to find the difference and intersection between two sketches.
Does this PR introduce any user-facing change?
Yes, this PR introduces three new aggregate functions:
theta_sketch_agg
theta_intersection_agg
theta_union_agg
And four new scalar functions:
theta_sketch_estimate
theta_intersection
theta_difference
theta_union
How was this patch tested?
I've included some tests in the test suites. I also tested it locally on my machine.
Was this patch authored or co-authored using generative AI tooling?
No
Jira
https://issues.apache.org/jira/browse/SPARK-52407