Skip to content

Conversation

cboumalh
Copy link

@cboumalh cboumalh commented Jun 27, 2025

What changes were proposed in this pull request?

This PR's goal is to provide 7 new functions which utilize Theta Sketches.
The functions are:

theta_sketch_agg(sketch_col)
theta_union(sketch1, sketch2)
theta_union_agg(sketch_col)
theta_difference(sketch1, sketch2)
theta_intersection(sketch1, sketch2)
theta_intersection_agg(sketch_col)
theta_sketch_estimate(sketch)

Why are the changes needed?

Today, Spark supports HyperLogLog. HLL only supports the union set operation. Theta on the other hand allows us to find the difference and intersection between two sketches.

Does this PR introduce any user-facing change?

Yes, this PR introduces three new aggregate functions:

theta_sketch_agg
theta_intersection_agg
theta_union_agg

And four new scalar functions:

theta_sketch_estimate
theta_intersection
theta_difference
theta_union

How was this patch tested?

I've included some tests in the test suites. I also tested it locally on my machine.

Was this patch authored or co-authored using generative AI tooling?

No

Jira

https://issues.apache.org/jira/browse/SPARK-52407

@github-actions github-actions bot added the SQL label Jun 27, 2025
@cboumalh cboumalh force-pushed the SPARK-52407_add_datasketches_thetasketch branch from b301ac8 to e9a2cbe Compare June 27, 2025 01:12
@cboumalh
Copy link
Author

cboumalh commented Jun 27, 2025

I know there is probably a lot missing in the PR before it's ready to merge. Please let me know!
@mkaravel @RyanBerti

@cboumalh cboumalh marked this pull request as draft June 27, 2025 01:16
@cboumalh cboumalh marked this pull request as ready for review July 16, 2025 17:22
@cboumalh
Copy link
Author

cboumalh commented Jul 21, 2025

Below are some of the results I came up with. This run was using spark shell on a c6a.2xlarge instance.

Sketch Type Data Type Size Unique Count Estimate Error % Time (ms)
Theta Integer 12 20000000 19706898 1.47 315
Theta String 12 20000000 20158220 0.79 679
HLL Integer 12 20000000 20084065 0.42 297
HLL String 12 20000000 20413618 2.07 628
Theta Integer 20 20000000 20001631 0.01 1313
Theta String 20 20000000 20005869 0.03 1884
HLL Integer 20 20000000 20029729 0.15 366
HLL String 20 20000000 20012835 0.06 694
Theta Integer 12 10000000 10101680 1.02 154
Theta String 12 10000000 9984868 0.15 361
HLL Integer 12 10000000 10146710 1.47 131
HLL String 12 10000000 10129938 1.30 296
Theta Integer 20 10000000 9995024 0.05 924
Theta String 20 10000000 9994589 0.05 1161
HLL Integer 20 10000000 9995662 0.04 208
HLL String 20 10000000 10004646 0.05 379
Theta Integer 12 5000000 5068868 1.38 103
Theta String 12 5000000 5026543 0.53 202
HLL Integer 12 5000000 5057766 1.16 128
HLL String 12 5000000 4906794 1.86 207
Theta Integer 20 5000000 4999593 0.01 592
Theta String 20 5000000 5001495 0.03 693
HLL Integer 20 5000000 4996974 0.06 175
HLL String 20 5000000 5004207 0.08 284
Theta Integer 12 1000000 1003909 0.39 75
Theta String 12 1000000 990056 0.99 91
HLL Integer 12 1000000 976680 2.33 56
HLL String 12 1000000 1002235 0.22 92
Theta Integer 20 1000000 1000000 0.00 227
Theta String 20 1000000 1000000 0.00 227
HLL Integer 20 1000000 999921 0.01 113
HLL String 20 1000000 1000244 0.02 138
Theta Integer 12 500000 493167 1.37 78
Theta String 12 500000 491379 1.72 112
HLL Integer 12 500000 502392 0.48 107
HLL String 12 500000 488070 2.39 85
Theta Integer 20 500000 500000 0.00 163
Theta String 20 500000 500000 0.00 233
HLL Integer 20 500000 500639 0.13 91
HLL String 20 500000 500114 0.02 128
Theta Integer 12 100000 98046 1.95 195
Theta String 12 100000 98214 1.79 172
HLL Integer 12 100000 102878 2.88 115
HLL String 12 100000 100078 0.08 97
Theta Integer 20 100000 100000 0.00 120
Theta String 20 100000 100000 0.00 142
HLL Integer 20 100000 100013 0.01 102
HLL String 20 100000 100013 0.01 93

Results comparison by group

100000 entries (size 20):

  • Integer: Theta slower by 1.18x, Error diff: 0.01%
  • String: Theta slower by 1.53x, Error diff: 0.01%

100000 entries (size 12):

  • Integer: Theta slower by 1.70x, Error diff: 0.92%
  • String: Theta slower by 1.77x, Error diff: 1.71%

500000 entries (size 20):

  • Integer: Theta slower by 1.79x, Error diff: 0.13%
  • String: Theta slower by 1.82x, Error diff: 0.02%

500000 entries (size 12):

  • Integer: Theta faster by 1.37x, Error diff: 0.89%
  • String: Theta slower by 1.32x, Error diff: 0.66%

1000000 entries (size 20):

  • Integer: Theta slower by 2.01x, Error diff: 0.01%
  • String: Theta slower by 1.64x, Error diff: 0.02%

1000000 entries (size 12):

  • Integer: Theta slower by 1.34x, Error diff: 1.94%
  • String: Theta faster by 1.01x, Error diff: 0.77%

5000000 entries (size 20):

  • Integer: Theta slower by 3.38x, Error diff: 0.05%
  • String: Theta slower by 2.44x, Error diff: 0.05%

5000000 entries (size 12):

  • Integer: Theta faster by 1.24x, Error diff: 0.22%
  • String: Theta faster by 1.02x, Error diff: 1.33%

10000000 entries (size 20):

  • Integer: Theta slower by 4.44x, Error diff: 0.01%
  • String: Theta slower by 3.06x, Error diff: 0.01%

10000000 entries (size 12):

  • Integer: Theta slower by 1.18x, Error diff: 0.45%
  • String: Theta slower by 1.22x, Error diff: 1.15%

20000000 entries (size 20):

  • Integer: Theta slower by 3.59x, Error diff: 0.14%
  • String: Theta slower by 2.71x, Error diff: 0.03%

20000000 entries (size 12):

  • Integer: Theta slower by 1.06x, Error diff: 1.05%
  • String: Theta slower by 1.08x, Error diff: 1.28%

Summary and comments

Theta is competitive. It wins or is very close in many cases. Where HLL dominates is with larger sketch sizes, consistently 3-5x faster than Theta. Theta's major weakness is that it becomes significantly slower at higher precision, but on the plus side it is very accurate. Choose Theta sketches when: You need set operations (intersection, difference, union), size 12 is acceptable for your accuracy needs, and you prioritize functionality over raw speed.

https://gist.github.com/cboumalh/81ae7e1362a35d2a9968184fa1a52e0d

@xiongbo-sjtu
Copy link

LGTM, but I don't have permission to approve it.

@dongjoon-hyun Can you please help find someone to review this PR? Thanks!

Chris Boumalhab added 2 commits August 10, 2025 13:41
@dtenedor
Copy link
Contributor

I can help review this, I reviewed the original HLL sketch aggregate function contributions by Ryan Berti earlier.

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, thanks again for putting in all the work to make this implementation available for Spark!!

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking almost done, just a few comments left and we can merge it soon.

Copy link
Contributor

@dtenedor dtenedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed everything again and the functionality looks clear and thoroughly-tested. We can leave the PR open for another day or so if anyone else has any concerns, then proceed to merge it.

@cboumalh
Copy link
Author

cboumalh commented Sep 4, 2025

Sounds good! Thanks a lot for the thorough review. I’ll be available in case others want to chime in, and I’m happy to address any last-minute feedback. Really appreciate your help moving this forward and excited to make this feature available for the Spark community.

Chris Boumalhab added 4 commits September 9, 2025 00:02
Copy link
Contributor

@mkaravel mkaravel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cboumalh Thank you for addressing the review comments!

One more quick round of review. Please take a look.
I intend to do another pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants