[Variant] Revisit VariantMetadata and Object equality #7961

friendlymatthew · 2025-07-18T11:46:14Z

Rationale for this change

If a variant has an unsorted dictionary, you can't assume fields are unique nor ordered by name. This PR updates the logical equality check among VariantMetadata to properly handle this case.

Closes validated and is_fully_validated flags doesn't need to be part of PartialEq #7952

It also fixes a bug in #7934 where we do a uniqueness check when probing an unsorted dictionary

friendlymatthew · 2025-07-18T11:47:27Z

cc @alamb @scovich

parquet-variant/src/variant/metadata.rs

alamb

While reviewing this I was thinking maybe we should revisit equality

I think what we are doing is trying to make Variant::eq to compare if the Variants are logically equal (that is represent the same data), rather than comparing if the Varaints are physically equal (that is represented with the same bytes)

If we are trying to make Variant::eq compare logically, I think we shouldn't be comparing VariantMetadata at all (as in remove this)

https://github.com/apache/arrow-rs/blob/8e88fdc943ea94acbc7ff0a44fe94f3b19636c7b/parquet-variant/src/variant/object.rs#L417-L416

So comparing two VariantObjects should just be an exercise in comparing the fields individually.

alamb · 2025-07-18T13:06:27Z

parquet-variant/src/variant/object.rs

+        dbg!(m.iter().collect::<Vec<_>>());
+        let v2 = Variant::new_with_metadata(m, &v);
+
+        dbg!(v1.as_object().unwrap().iter().collect::<Vec<_>>());


do we still need the dbg!?

alamb · 2025-07-18T13:08:17Z

parquet-variant/src/variant/metadata.rs

-            && self.validated == other.validated;
-
-        let other_field_names: HashSet<&'m str> = HashSet::from_iter(other.iter());
+        self.is_empty() == other.is_empty()


This is likely to be quite slow for nested variants as it will continually re-compare the same metadata. It is probably ok for now (and was not made worse by this PR)

scovich · 2025-07-18T16:12:24Z

While reviewing this I was thinking maybe we should revisit equality

I think what we are doing is trying to make Variant::eq to compare if the Variants are logically equal (that is represent the same data), rather than comparing if the Varaints are physically equal (that is represented with the same bytes)

I agree that whatever we do should not be merely physical byte comparisons... but what does logical equality even mean? As in, if two variant objects compare logically equal, what can I do with that newfound knowledge? What goes wrong if I treat two variant objects as "equal" when they are not? etc.

Also, it seems like there are several pitfalls in the spec that we'll have to worry about when trying to define logical equivalence

Primitive values -- the variant spec defines a notion of equivalence class that would almost certainly consider logical equality as a "user expression":

User expressions operating on an int8 value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.

... but int, float and double are all in different equivalence classes, so by a strict reading of the spec 1f32 is not equal to 1f64 is not equal to 1i8 -- even tho they are exactly the same value (???)
Short string -- The spec requires that

operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding.

... so we'd have to add that special case to any logical comparison
Array -- is_large and field_offset_size_minus_one header fields both change the physical layout of the bytes that follow, but do not impact logical equivalence. Further, one must (recursively) logically compare each element as well, because equivalent objects can have physically different bytes.
Object -- Same considerations as Array, but also with field_id_size_minus_one. Additionally, field ids become highly problematic:
- One could claim that two objects are the same if their field ids are the same. But that's only true if they both were encoded using the same metadata dictionary!
- One could claim that two objects are not the same if their field ids differ. But that wouldn't necessarily be true if they were encoded with different metadata dictionaries. It might not even be true when both were encoded with the same (unordered) metadata dictionary, because the same string could appear multiple times with different field ids.
- One could do actual string comparisons, probing the objects' respective metadata dictionaries. But to what end? Even if they compare logically equal, it's not safe to treat them as equivalent (e.g. by copying a field's bytes from variant object to another), because the field ids would need to be rewritten -- recursively -- or they become garbage.

That last part is my biggest worry -- if it's not safe to physically swap the bytes of two logically equivalent variant objects (because the field ids could go out of sync), what use is logical equivalence? I suspect this is why the Delta spec states that "variant is not a comparable data type" and does not support variant in any context that requires comparisons (partition columns, clustering, etc). I couldn't find any Databricks documentation specifying the behavior of variant comparisons.

If we are trying to make Variant::eq compare logically, I think we shouldn't be comparing VariantMetadata at all (as in remove this)

I tend to agree that metadata dictionaries are not, by themselves, comparable. Any attempt to make them comparable would have to be partially physical and partly logical:

value of sorted_strings must match. Technically, it's possible an unsorted one could be equivalent to a sorted one, if its strings happened to be sorted... that's a rare enough case I don't think it's worth optimizing for.
offset_size_minus_one should not influence the comparison
dictionary_size must match
the unpacked offsets must all match (they may not be encoded using the same number of bytes)
the bytes must all match

That way, two metadata dictionaries only compare equal if they contain the same strings and they assign the same field ids to those strings. Such a logical comparison makes it safe to swap the bytes of one metadata dictionary with the bytes of another that compares logically equal, e.g. to improve parquet dictionary encoding of the field. But I'm not sure that would happen often enough to be worth optimizing for? Especially because (for unordered metadata at least) one would likely want the ability to replace a metadata dictionary with a different one that provides a superset of field names (with matching field ids in the common part).

alamb · 2025-07-18T17:48:03Z

I agree that whatever we do should not be merely physical byte comparisons... but what does logical equality even mean? As in, if two variant objects compare logically equal, what can I do with that newfound knowledge? What goes wrong if I treat two variant objects as "equal" when they are not? etc.

In my mind, the biggest usecase for eq is unit tests -- when we want to assert that the content of two Variants is equal -- like encoding expected results from converting from JSON to variant

the variant spec defines a notion of equivalence class that would almost certainly consider logical equality as a "user expression":

This is a good point, but I suggest we make any 'equivalence class' based comparison explicit (like Variant::class_eq or something when the need arises)

Short string -- The spec requires that

Yes, let's do that

One could claim that two objects are the same if their field ids are the same. But that's only true if they both were encoded using the same metadata dictionary!

I think this is a function of how we are defining equality -- I think I would probably try and define equality so it did not include the contents of the dictionary (aka does not rely on the actual values of the field_ids)

alamb · 2025-07-18T17:50:15Z

So I really think it is important to be able to compare the logical value the Variant encodes for the purpose of tests. You can see almost all tests do this, and as we move into shredding it will be important to be able to compare the contents

As you point out above, it is not 100% clear what the other uses of equality should be used for

friendlymatthew · 2025-07-19T09:00:51Z

Hi, I think this all makes sense. Here's a plan of what I think we want to do:

Variant objects's PartialEq implementation is purely logical. Essentially checking whether two objects have the same field name: Variant pairing
Refactor VariantMetadata's partial eq implementation to follow [Variant] Revisit VariantMetadata and Object equality #7961 (comment)

For a later task, it would be interesting to define equivalence rules for variants using the equivalence classes that the spec states

friendlymatthew · 2025-07-19T09:06:10Z

That way, two metadata dictionaries only compare equal if they contain the same strings and they assign the same field ids to those strings. Such a logical comparison makes it safe to swap the bytes of one metadata dictionary with the bytes of another that compares logically equal, e.g. to improve parquet dictionary encoding of the field. But I'm not sure that would happen often enough to be worth optimizing for? Especially because (for unordered metadata at least) one would likely want the ability to replace a metadata dictionary with a different one that provides a superset of field names (with matching field ids in the common part).

I think @scovich's comment about metadata equality is super interesting. I will think more about this

I could imagine having such logical comparison could be useful. @alamb and I were discussing encoding a single metadata dictionary per parquet file. This would only be possible if we know whether every row in the metadata column have the "same" metadata dictionary

friendlymatthew · 2025-07-19T09:29:39Z

That last part is my biggest worry -- if it's not safe to physically swap the bytes of two logically equivalent variant objects (because the field ids could go out of sync), what use is logical equivalence?

Interesting, I wonder if we scope this custom PartialEq implementation on VariantObject to purely for tests.

Since objects check whether all field entries bijectively map to another objects' field entries, it is quite easy to run into panics or even mutate the field entries.

alamb · 2025-07-19T11:18:42Z

That last part is my biggest worry -- if it's not safe to physically swap the bytes of two logically equivalent variant objects (because the field ids could go out of sync), what use is logical equivalence?

Interesting, I wonder if we scope this custom PartialEq implementation on VariantObject to purely for tests.

FWIW I think the major usecase for swapping varaint bytes is likely the creation of shredded values (copying the non shredded parts)

In this case, I think what we should do is compare the VariantMetadata (by pointer) and if it is the same we can simply copy the bytes. Maybe we even have a special / optimized method that does this that is used by the shredding code

alamb

Thanks @friendlymatthew -- I had a suggestion on how to make this code more efficient, but I think we can do it as a follow on PR as well

alamb · 2025-07-21T15:11:58Z

parquet-variant/src/variant/metadata.rs

@@ -127,7 +125,7 @@ impl VariantMetadataHeader {
 ///
 /// [`Variant`]: crate::Variant
 /// [Variant Spec]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding
-#[derive(Debug, Clone)]
+#[derive(Debug, Clone, PartialEq)]


parquet-variant/src/variant/object.rs

alamb · 2025-07-21T15:16:12Z

parquet-variant/src/variant/object.rs

        // v2 is not sorted
        assert!(!v2.metadata().unwrap().is_sorted());

+        // object metadata are not the same


scovich

Still not quite correct (doesn't perform a symmetric diff). Fortunately, fixing that bug also simplifies the implementation quite a bit.

scovich · 2025-07-21T19:53:58Z

parquet-variant/src/variant/object.rs

@@ -263,7 +264,7 @@ impl<'m, 'v> VariantObject<'m, 'v> {
                    let next_field_name = self.metadata.get(field_id)?;

                    if let Some(current_name) = current_field_name {
-                        if next_field_name <= current_name {
+                        if next_field_name < current_name {


Bug fix, right?

scovich · 2025-07-21T20:00:52Z

parquet-variant/src/variant/object.rs

        let other_fields: HashMap<&str, Variant> = HashMap::from_iter(other.iter());

        for (field_name, variant) in self.iter() {
            match other_fields.get(field_name as &str) {
                Some(other_variant) => {


We don't need a hash map at all -- IFF the two are valid and logically equal, they will have the same field names in the same order, because the spec requires the object fields to be sorted lexicographically. It should be enough to co-iterate over the two sets of field+value pairs after verifying the field counts match, e.g.

for ((name_a, value_a), (name_b, value_b)) in self.iter().zip(other.iter()) { if name_a != name_b || value_a != value_b { return false; } }

@friendlymatthew are you willing to make this change? I can give it a shot too if you prefer

I pushed the change in 26dd44b to keep this PR moving

alamb · 2025-07-21T23:04:53Z

Still not quite correct (doesn't perform a symmetric diff). Fortunately, fixing that bug also simplifies the implementation quite a bit.

I am not sure we need to perform a symmetric diff given there is a check for len and then each element is verified 🤔

scovich · 2025-07-22T02:25:27Z

Still not quite correct (doesn't perform a symmetric diff). Fortunately, fixing that bug also simplifies the implementation quite a bit.

I am not sure we need to perform a symmetric diff given there is a check for len and then each element is verified 🤔

Good point. We can still get rid of the hash table by co-iterating tho?

…ata-eq

alamb · 2025-07-22T21:56:38Z

Still not quite correct (doesn't perform a symmetric diff). Fortunately, fixing that bug also simplifies the implementation quite a bit.

I am not sure we need to perform a symmetric diff given there is a check for len and then each element is verified 🤔

Good point. We can still get rid of the hash table by co-iterating tho?

Done!

alamb · 2025-07-22T22:04:49Z

🚀

github-actions bot added the parquet Changes to the parquet crate label Jul 18, 2025

friendlymatthew mentioned this pull request Jul 18, 2025

[Variant] Avoid collecting offset iterator #7934

Merged

friendlymatthew commented Jul 18, 2025

View reviewed changes

parquet-variant/src/variant/metadata.rs Show resolved Hide resolved

alamb mentioned this pull request Jul 18, 2025

[Variant] Impl PartialEq for VariantObject #7943

Merged

alamb reviewed Jul 18, 2025

View reviewed changes

Revisit VariantMetadata and Object equality

b3f5077

friendlymatthew force-pushed the friendlymatthew/metadata-eq branch from 8e88fdc to b3f5077 Compare July 19, 2025 09:07

friendlymatthew force-pushed the friendlymatthew/metadata-eq branch from de71b95 to c40eda9 Compare July 20, 2025 10:20

Logically compare objects, remove custom partial eq impl for metadata

41f7d52

friendlymatthew force-pushed the friendlymatthew/metadata-eq branch from c40eda9 to 41f7d52 Compare July 20, 2025 10:22

alamb approved these changes Jul 21, 2025

View reviewed changes

Eagerly return false during equality cmp

dc32619

scovich reviewed Jul 21, 2025

View reviewed changes

alamb mentioned this pull request Jul 22, 2025

Partial eq variant no validation #7957

Open

alamb added 2 commits July 22, 2025 17:54

Avoid HashMap on equality check

26dd44b

Merge remote-tracking branch 'apache/main' into friendlymatthew/metad…

9b4ec17

…ata-eq

alamb merged commit f39461c into apache:main Jul 22, 2025
12 checks passed

[Variant] Revisit VariantMetadata and Object equality #7961

[Variant] Revisit VariantMetadata and Object equality #7961

Conversation

friendlymatthew commented Jul 18, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Uh oh!

friendlymatthew commented Jul 18, 2025

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich commented Jul 18, 2025

Uh oh!

alamb commented Jul 18, 2025

Uh oh!

alamb commented Jul 18, 2025

Uh oh!

friendlymatthew commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

friendlymatthew commented Jul 19, 2025

Uh oh!

friendlymatthew commented Jul 19, 2025

Uh oh!

alamb commented Jul 19, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 21, 2025

Uh oh!

scovich commented Jul 22, 2025

Uh oh!

alamb commented Jul 22, 2025

Uh oh!

Uh oh!

alamb commented Jul 22, 2025

Uh oh!

Uh oh!

friendlymatthew commented Jul 18, 2025 •

edited by alamb

Loading

friendlymatthew commented Jul 19, 2025 •

edited

Loading