Skip to content

[Variant] Revisit VariantMetadata and Object equality #7961

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 22, 2025

Conversation

friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jul 18, 2025

Rationale for this change

If a variant has an unsorted dictionary, you can't assume fields are unique nor ordered by name. This PR updates the logical equality check among VariantMetadata to properly handle this case.

It also fixes a bug in #7934 where we do a uniqueness check when probing an unsorted dictionary

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 18, 2025
@friendlymatthew
Copy link
Contributor Author

cc @alamb @scovich

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While reviewing this I was thinking maybe we should revisit equality

I think what we are doing is trying to make Variant::eq to compare if the Variants are logically equal (that is represent the same data), rather than comparing if the Varaints are physically equal (that is represented with the same bytes)

If we are trying to make Variant::eq compare logically, I think we shouldn't be comparing VariantMetadata at all (as in remove this)

https://github.com/apache/arrow-rs/blob/8e88fdc943ea94acbc7ff0a44fe94f3b19636c7b/parquet-variant/src/variant/object.rs#L417-L416

So comparing two VariantObjects should just be an exercise in comparing the fields individually.

dbg!(m.iter().collect::<Vec<_>>());
let v2 = Variant::new_with_metadata(m, &v);

dbg!(v1.as_object().unwrap().iter().collect::<Vec<_>>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need the dbg!?

&& self.validated == other.validated;

let other_field_names: HashSet<&'m str> = HashSet::from_iter(other.iter());
self.is_empty() == other.is_empty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely to be quite slow for nested variants as it will continually re-compare the same metadata. It is probably ok for now (and was not made worse by this PR)

@scovich
Copy link
Contributor

scovich commented Jul 18, 2025

While reviewing this I was thinking maybe we should revisit equality

I think what we are doing is trying to make Variant::eq to compare if the Variants are logically equal (that is represent the same data), rather than comparing if the Varaints are physically equal (that is represented with the same bytes)

I agree that whatever we do should not be merely physical byte comparisons... but what does logical equality even mean? As in, if two variant objects compare logically equal, what can I do with that newfound knowledge? What goes wrong if I treat two variant objects as "equal" when they are not? etc.

Also, it seems like there are several pitfalls in the spec that we'll have to worry about when trying to define logical equivalence

  • Primitive values -- the variant spec defines a notion of equivalence class that would almost certainly consider logical equality as a "user expression":

    User expressions operating on an int8 value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.

    ... but int, float and double are all in different equivalence classes, so by a strict reading of the spec 1f32 is not equal to 1f64 is not equal to 1i8 -- even tho they are exactly the same value (???)

  • Short string -- The spec requires that

    operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding.

    ... so we'd have to add that special case to any logical comparison

  • Array -- is_large and field_offset_size_minus_one header fields both change the physical layout of the bytes that follow, but do not impact logical equivalence. Further, one must (recursively) logically compare each element as well, because equivalent objects can have physically different bytes.

  • Object -- Same considerations as Array, but also with field_id_size_minus_one. Additionally, field ids become highly problematic:

    • One could claim that two objects are the same if their field ids are the same. But that's only true if they both were encoded using the same metadata dictionary!
    • One could claim that two objects are not the same if their field ids differ. But that wouldn't necessarily be true if they were encoded with different metadata dictionaries. It might not even be true when both were encoded with the same (unordered) metadata dictionary, because the same string could appear multiple times with different field ids.
    • One could do actual string comparisons, probing the objects' respective metadata dictionaries. But to what end? Even if they compare logically equal, it's not safe to treat them as equivalent (e.g. by copying a field's bytes from variant object to another), because the field ids would need to be rewritten -- recursively -- or they become garbage.

That last part is my biggest worry -- if it's not safe to physically swap the bytes of two logically equivalent variant objects (because the field ids could go out of sync), what use is logical equivalence? I suspect this is why the Delta spec states that "variant is not a comparable data type" and does not support variant in any context that requires comparisons (partition columns, clustering, etc). I couldn't find any Databricks documentation specifying the behavior of variant comparisons.

If we are trying to make Variant::eq compare logically, I think we shouldn't be comparing VariantMetadata at all (as in remove this)

I tend to agree that metadata dictionaries are not, by themselves, comparable. Any attempt to make them comparable would have to be partially physical and partly logical:

  • value of sorted_strings must match. Technically, it's possible an unsorted one could be equivalent to a sorted one, if its strings happened to be sorted... that's a rare enough case I don't think it's worth optimizing for.
  • offset_size_minus_one should not influence the comparison
  • dictionary_size must match
  • the unpacked offsets must all match (they may not be encoded using the same number of bytes)
  • the bytes must all match

That way, two metadata dictionaries only compare equal if they contain the same strings and they assign the same field ids to those strings. Such a logical comparison makes it safe to swap the bytes of one metadata dictionary with the bytes of another that compares logically equal, e.g. to improve parquet dictionary encoding of the field. But I'm not sure that would happen often enough to be worth optimizing for? Especially because (for unordered metadata at least) one would likely want the ability to replace a metadata dictionary with a different one that provides a superset of field names (with matching field ids in the common part).

@alamb
Copy link
Contributor

alamb commented Jul 18, 2025

I agree that whatever we do should not be merely physical byte comparisons... but what does logical equality even mean? As in, if two variant objects compare logically equal, what can I do with that newfound knowledge? What goes wrong if I treat two variant objects as "equal" when they are not? etc.

In my mind, the biggest usecase for eq is unit tests -- when we want to assert that the content of two Variants is equal -- like encoding expected results from converting from JSON to variant

the variant spec defines a notion of equivalence class that would almost certainly consider logical equality as a "user expression":

This is a good point, but I suggest we make any 'equivalence class' based comparison explicit (like Variant::class_eq or something when the need arises)

Short string -- The spec requires that

Yes, let's do that

One could claim that two objects are the same if their field ids are the same. But that's only true if they both were encoded using the same metadata dictionary!

I think this is a function of how we are defining equality -- I think I would probably try and define equality so it did not include the contents of the dictionary (aka does not rely on the actual values of the field_ids)

@alamb
Copy link
Contributor

alamb commented Jul 18, 2025

So I really think it is important to be able to compare the logical value the Variant encodes for the purpose of tests. You can see almost all tests do this, and as we move into shredding it will be important to be able to compare the contents

As you point out above, it is not 100% clear what the other uses of equality should be used for

@friendlymatthew
Copy link
Contributor Author

friendlymatthew commented Jul 19, 2025

Hi, I think this all makes sense. Here's a plan of what I think we want to do:

For a later task, it would be interesting to define equivalence rules for variants using the equivalence classes that the spec states

@friendlymatthew
Copy link
Contributor Author

That way, two metadata dictionaries only compare equal if they contain the same strings and they assign the same field ids to those strings. Such a logical comparison makes it safe to swap the bytes of one metadata dictionary with the bytes of another that compares logically equal, e.g. to improve parquet dictionary encoding of the field. But I'm not sure that would happen often enough to be worth optimizing for? Especially because (for unordered metadata at least) one would likely want the ability to replace a metadata dictionary with a different one that provides a superset of field names (with matching field ids in the common part).

I think @scovich's comment about metadata equality is super interesting. I will think more about this

I could imagine having such logical comparison could be useful. @alamb and I were discussing encoding a single metadata dictionary per parquet file. This would only be possible if we know whether every row in the metadata column have the "same" metadata dictionary

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/metadata-eq branch from 8e88fdc to b3f5077 Compare July 19, 2025 09:07
@friendlymatthew
Copy link
Contributor Author

That last part is my biggest worry -- if it's not safe to physically swap the bytes of two logically equivalent variant objects (because the field ids could go out of sync), what use is logical equivalence?

Interesting, I wonder if we scope this custom PartialEq implementation on VariantObject to purely for tests.

Since objects check whether all field entries bijectively map to another objects' field entries, it is quite easy to run into panics or even mutate the field entries.

@alamb
Copy link
Contributor

alamb commented Jul 19, 2025

That last part is my biggest worry -- if it's not safe to physically swap the bytes of two logically equivalent variant objects (because the field ids could go out of sync), what use is logical equivalence?

Interesting, I wonder if we scope this custom PartialEq implementation on VariantObject to purely for tests.

FWIW I think the major usecase for swapping varaint bytes is likely the creation of shredded values (copying the non shredded parts)

In this case, I think what we should do is compare the VariantMetadata (by pointer) and if it is the same we can simply copy the bytes. Maybe we even have a special / optimized method that does this that is used by the shredding code

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/metadata-eq branch from de71b95 to c40eda9 Compare July 20, 2025 10:20
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/metadata-eq branch from c40eda9 to 41f7d52 Compare July 20, 2025 10:22
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friendlymatthew -- I had a suggestion on how to make this code more efficient, but I think we can do it as a follow on PR as well

@@ -127,7 +125,7 @@ impl VariantMetadataHeader {
///
/// [`Variant`]: crate::Variant
/// [Variant Spec]: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#metadata-encoding
#[derive(Debug, Clone)]
#[derive(Debug, Clone, PartialEq)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// v2 is not sorted
assert!(!v2.metadata().unwrap().is_sorted());

// object metadata are not the same
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not quite correct (doesn't perform a symmetric diff). Fortunately, fixing that bug also simplifies the implementation quite a bit.

@@ -263,7 +264,7 @@ impl<'m, 'v> VariantObject<'m, 'v> {
let next_field_name = self.metadata.get(field_id)?;

if let Some(current_name) = current_field_name {
if next_field_name <= current_name {
if next_field_name < current_name {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug fix, right?

Comment on lines 420 to 424
let other_fields: HashMap<&str, Variant> = HashMap::from_iter(other.iter());

for (field_name, variant) in self.iter() {
match other_fields.get(field_name as &str) {
Some(other_variant) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a hash map at all -- IFF the two are valid and logically equal, they will have the same field names in the same order, because the spec requires the object fields to be sorted lexicographically. It should be enough to co-iterate over the two sets of field+value pairs after verifying the field counts match, e.g.

for ((name_a, value_a), (name_b, value_b)) in self.iter().zip(other.iter()) {
    if name_a != name_b || value_a != value_b {
        return false;
    }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@friendlymatthew are you willing to make this change? I can give it a shot too if you prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed the change in 26dd44b to keep this PR moving

@alamb
Copy link
Contributor

alamb commented Jul 21, 2025

Still not quite correct (doesn't perform a symmetric diff). Fortunately, fixing that bug also simplifies the implementation quite a bit.

I am not sure we need to perform a symmetric diff given there is a check for len and then each element is verified 🤔

@scovich
Copy link
Contributor

scovich commented Jul 22, 2025

Still not quite correct (doesn't perform a symmetric diff). Fortunately, fixing that bug also simplifies the implementation quite a bit.

I am not sure we need to perform a symmetric diff given there is a check for len and then each element is verified 🤔

Good point. We can still get rid of the hash table by co-iterating tho?

@alamb
Copy link
Contributor

alamb commented Jul 22, 2025

Still not quite correct (doesn't perform a symmetric diff). Fortunately, fixing that bug also simplifies the implementation quite a bit.

I am not sure we need to perform a symmetric diff given there is a check for len and then each element is verified 🤔

Good point. We can still get rid of the hash table by co-iterating tho?

Done!

@alamb alamb merged commit f39461c into apache:main Jul 22, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 22, 2025

🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

validated and is_fully_validated flags doesn't need to be part of PartialEq
3 participants