[SPARK-52828][SQL] Make hashing for collated strings collation agnostic #51521
+324
−70
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
We change the behavior of the
Murmur3Hash
andXxHash64
catalyst expressions to be collation agnostic (i.e. collation-unaware). Also, we introduce two new internal catalyst expressions:CollationAwareMurmur3Hash
andCollationAwareXxHash64
, which are collation aware and take the collation of the string into consideration when hashing collated strings.Furthermore, we replace
Murmur3Hash
andXxHash64
in expressions where the hash expressions should be collation aware withCollationAwareMurmur3Hash
andCollationAwareXxHash64
. This is necessary for example when we do hash partitioning. Moreover, we change the way hashing is done for collated strings for the internal HiveHash expression to be consistent with the rest of the hashing expressions (the HiveHash expression is meant to always be collation-aware).Finally, we add a kill switch (the SQL config is
COLLATION_AGNOSTIC_HASHING_ENABLED
) that allows to recover the previous behavior ofMurmur3Hash
andXxHash64
as user-facing expressions. The kill switch has no effect on the new collation aware hashing expressions, or the HiveHash expression, which are internal and need to follow the new collation aware behavior.Why are the changes needed?
The
Murmur3Hash
andXxHash64
catalyst expressions, when applied to collated strings, currently always take into consideration the collation of the string, that is they are collation aware. This is not the correct behavior, and these expressions should be collation agnostic by default instead.Does this PR introduce any user-facing change?
Yes, see the detailed explanation above.
How was this patch tested?
Updated existing tests in relevant suites: CollationFactorySuite, DistributionSuite, and HashExpressionsSuite. Also verified that the CollationSuite suite passes.
Was this patch authored or co-authored using generative AI tooling?
No.