Skip to content

[Variant] WIP Tests for variant_get of shredded variants #7965

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 18, 2025

Which issue does this PR close?

Rationale for this change

In order to make progress on accessing shredded variants, I think we need some tests showing how they are used and constructed. I am hoping some tests will help us along in figuring out how to implement this efficiently

I am hoping someone can use these tests to start implement efficient variant_get for shredded values

What changes are included in this PR?

Implement varaint_get tests based on the examples in the arrow variant doc from @zeroshade (see apache/arrow#46908)

Are these changes tested?

They are only tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 18, 2025
@alamb
Copy link
Contributor Author

alamb commented Jul 18, 2025

FYI @Samyak2 @scovich @friendlymatthew and @klion26 and @carpecodeum as I think you are interested in this feature

Comment on lines +282 to +289
let struct_array = StructArrayBuilder::new()
.with_field("metadata", Arc::new(metadata))
.with_field("typed_value", Arc::new(typed_value))
.build();

Arc::new(
VariantArray::try_new(Arc::new(struct_array)).expect("should create variant array"),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood of the spec, the value field needs to be present but it can have nulls. Did I understand it incorrectly? If not, this struct would be an invalid variant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reading of https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/ was that value was optional in the case the variant shredded "perfectly"

Screenshot 2025-07-18 at 3 04 18 PM

I may also be misreading it.

Maybe @zeroshade has some time to coment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shredding spec definitely says both columns are optional:

Both value and typed_value are optional fields used together to encode a single value.

Additionally:

Values in the two fields must be interpreted according to the following table:

value typed_value Meaning
null null The value is missing; only valid for shredded object fields
non-null null The value is present and may be any type, including null
null non-null The value is present and is the shredded type
non-null non-null The value is present and is a partially shredded object

As always in parquet, a physically missing column is interpreted as all-null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC @cashmand in case I totally misunderstood something

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the intent is that a missing value column should be treated as equivalent to a present-but-all-null value column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, both are optional fields, but it's required to have at least one of them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks all!

@@ -177,4 +192,209 @@ mod test {
r#"{"inner_field": 1234}"#,
);
}

/// Shredding: extract a value as a VariantArray
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR just adds tests here -- to show what I think the output of variant_get should be for different variant cases.

@@ -146,7 +146,7 @@ impl VariantArrayBuilder {
}

/// Append the [`Variant`] to the builder as the next row
pub fn append_variant(&mut self, variant: Variant) {
pub fn append_variant<'m, 'v>(&mut self, variant: impl Into<Variant<'m, 'v>>) {
// TODO make this more efficient by avoiding the intermediate buffers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we implement the approach @scovich suggested about invoking primitive append_xxx methods directly instead of using intermediate VariantBuilder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like

pub fn append_variant<'m, 'v>(&mut self, variant: impl Into<Variant<'m, 'v>>) {
    let variant = variant.into();
    match variant {
        Variant::Int32(v) => self.append_int32(v),
        Variant::String(v) => self.append_string(v),
        Variant::Object(obj) => {
            // Build object directly into our buffers without intermediate VariantBuilder
            self.append_object_start();
            for field in obj.iter() {
                self.append_variant(field.value()); // recursive
            }
            self.append_object_end();
        }
        // ... other variants
    }
}

Copy link
Contributor Author

@alamb alamb Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are concerned about the intermediate buffers, I think after the following two PRs are merged:

There will be a way to append variant values to the final output buffer directly without copying to intermediate buffers

Is there another potential part of your question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will follow these PRs then

assert_eq!(result.len(), 3);

// Expect the values are the same as the original values
assert_eq!(result.value(0), Variant::Int32(1));
Copy link
Member

@klion26 klion26 Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the GetOptions points to the leaf node, will the result here be equal to the result in test get_variant_perfectly_shredded_int32_as_int32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants