-
Notifications
You must be signed in to change notification settings - Fork 976
[Variant] Revisit VariantMetadata and Object equality #7961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Revisit VariantMetadata and Object equality #7961
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While reviewing this I was thinking maybe we should revisit equality
I think what we are doing is trying to make Variant::eq
to compare if the Variants are logically equal (that is represent the same data), rather than comparing if the Varaints are physically equal (that is represented with the same bytes)
If we are trying to make Variant::eq compare logically, I think we shouldn't be comparing VariantMetadata
at all (as in remove this)
So comparing two VariantObjects should just be an exercise in comparing the fields individually.
dbg!(m.iter().collect::<Vec<_>>()); | ||
let v2 = Variant::new_with_metadata(m, &v); | ||
|
||
dbg!(v1.as_object().unwrap().iter().collect::<Vec<_>>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need the dbg!
?
&& self.validated == other.validated; | ||
|
||
let other_field_names: HashSet<&'m str> = HashSet::from_iter(other.iter()); | ||
self.is_empty() == other.is_empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to be quite slow for nested variants as it will continually re-compare the same metadata. It is probably ok for now (and was not made worse by this PR)
I agree that whatever we do should not be merely physical byte comparisons... but what does logical equality even mean? As in, if two variant objects compare logically equal, what can I do with that newfound knowledge? What goes wrong if I treat two variant objects as "equal" when they are not? etc. Also, it seems like there are several pitfalls in the spec that we'll have to worry about when trying to define logical equivalence
That last part is my biggest worry -- if it's not safe to physically swap the bytes of two logically equivalent variant objects (because the field ids could go out of sync), what use is logical equivalence? I suspect this is why the Delta spec states that "variant is not a comparable data type" and does not support variant in any context that requires comparisons (partition columns, clustering, etc). I couldn't find any Databricks documentation specifying the behavior of variant comparisons.
I tend to agree that metadata dictionaries are not, by themselves, comparable. Any attempt to make them comparable would have to be partially physical and partly logical:
That way, two metadata dictionaries only compare equal if they contain the same strings and they assign the same field ids to those strings. Such a logical comparison makes it safe to swap the bytes of one metadata dictionary with the bytes of another that compares logically equal, e.g. to improve parquet dictionary encoding of the field. But I'm not sure that would happen often enough to be worth optimizing for? Especially because (for unordered metadata at least) one would likely want the ability to replace a metadata dictionary with a different one that provides a superset of field names (with matching field ids in the common part). |
In my mind, the biggest usecase for
This is a good point, but I suggest we make any 'equivalence class' based comparison explicit (like
Yes, let's do that
I think this is a function of how we are defining equality -- I think I would probably try and define equality so it did not include the contents of the dictionary (aka does not rely on the actual values of the |
So I really think it is important to be able to compare the logical value the Variant encodes for the purpose of tests. You can see almost all tests do this, and as we move into shredding it will be important to be able to compare the contents As you point out above, it is not 100% clear what the other uses of equality should be used for |
Hi, I think this all makes sense. Here's a plan of what I think we want to do:
For a later task, it would be interesting to define equivalence rules for variants using the equivalence classes that the spec states |
I think @scovich's comment about metadata equality is super interesting. I will think more about this I could imagine having such logical comparison could be useful. @alamb and I were discussing encoding a single metadata dictionary per parquet file. This would only be possible if we know whether every row in the metadata column have the "same" metadata dictionary |
8e88fdc
to
b3f5077
Compare
Interesting, I wonder if we scope this custom Since objects check whether all field entries bijectively map to another objects' field entries, it is quite easy to run into panics or even mutate the field entries. |
FWIW I think the major usecase for swapping varaint bytes is likely the creation of shredded values (copying the non shredded parts) In this case, I think what we should do is compare the VariantMetadata (by pointer) and if it is the same we can simply copy the bytes. Maybe we even have a special / optimized method that does this that is used by the shredding code |
de71b95
to
c40eda9
Compare
c40eda9
to
41f7d52
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew -- I had a suggestion on how to make this code more efficient, but I think we can do it as a follow on PR as well
@@ -127,7 +125,7 @@ impl VariantMetadataHeader { | |||
/// | |||
/// [`Variant`]: crate::Variant | |||
/// [Variant Spec]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding | |||
#[derive(Debug, Clone)] | |||
#[derive(Debug, Clone, PartialEq)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
// v2 is not sorted | ||
assert!(!v2.metadata().unwrap().is_sorted()); | ||
|
||
// object metadata are not the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not quite correct (doesn't perform a symmetric diff). Fortunately, fixing that bug also simplifies the implementation quite a bit.
@@ -263,7 +264,7 @@ impl<'m, 'v> VariantObject<'m, 'v> { | |||
let next_field_name = self.metadata.get(field_id)?; | |||
|
|||
if let Some(current_name) = current_field_name { | |||
if next_field_name <= current_name { | |||
if next_field_name < current_name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug fix, right?
let other_fields: HashMap<&str, Variant> = HashMap::from_iter(other.iter()); | ||
|
||
for (field_name, variant) in self.iter() { | ||
match other_fields.get(field_name as &str) { | ||
Some(other_variant) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a hash map at all -- IFF the two are valid and logically equal, they will have the same field names in the same order, because the spec requires the object fields to be sorted lexicographically. It should be enough to co-iterate over the two sets of field+value pairs after verifying the field counts match, e.g.
for ((name_a, value_a), (name_b, value_b)) in self.iter().zip(other.iter()) {
if name_a != name_b || value_a != value_b {
return false;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@friendlymatthew are you willing to make this change? I can give it a shot too if you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed the change in 26dd44b to keep this PR moving
I am not sure we need to perform a symmetric diff given there is a check for |
Good point. We can still get rid of the hash table by co-iterating tho? |
Done! |
🚀 |
Rationale for this change
If a variant has an unsorted dictionary, you can't assume fields are unique nor ordered by name. This PR updates the logical equality check among
VariantMetadata
to properly handle this case.validated
andis_fully_validated
flags doesn't need to be part of PartialEq #7952It also fixes a bug in #7934 where we do a uniqueness check when probing an unsorted dictionary