Skip to content

[SPARK-52828][SQL] Make hashing for collated strings collation agnostic #51521

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

uros-db
Copy link
Contributor

@uros-db uros-db commented Jul 16, 2025

What changes were proposed in this pull request?

We change the behavior of the Murmur3Hash and XxHash64 catalyst expressions to be collation agnostic (i.e. collation-unaware). Also, we introduce two new internal catalyst expressions: CollationAwareMurmur3Hash and CollationAwareXxHash64, which are collation aware and take the collation of the string into consideration when hashing collated strings.

Furthermore, we replace Murmur3Hash and XxHash64 in expressions where the hash expressions should be collation aware with CollationAwareMurmur3Hash and CollationAwareXxHash64. This is necessary for example when we do hash partitioning. Moreover, we change the way hashing is done for collated strings for the internal HiveHash expression to be consistent with the rest of the hashing expressions (the HiveHash expression is meant to always be collation-aware).

Finally, we add a kill switch (the SQL config is COLLATION_AGNOSTIC_HASHING_ENABLED) that allows to recover the previous behavior of Murmur3Hash and XxHash64 as user-facing expressions. The kill switch has no effect on the new collation aware hashing expressions, or the HiveHash expression, which are internal and need to follow the new collation aware behavior.

Why are the changes needed?

The Murmur3Hash and XxHash64 catalyst expressions, when applied to collated strings, currently always take into consideration the collation of the string, that is they are collation aware. This is not the correct behavior, and these expressions should be collation agnostic by default instead.

Does this PR introduce any user-facing change?

Yes, see the detailed explanation above.

How was this patch tested?

Updated existing tests in relevant suites: CollationFactorySuite, DistributionSuite, and HashExpressionsSuite. Also verified that the CollationSuite suite passes.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Jul 16, 2025
@uros-db
Copy link
Contributor Author

uros-db commented Jul 16, 2025

@mkaravel @cloud-fan Please review.

@uros-db uros-db requested a review from cloud-fan July 17, 2025 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants