[Variant] Define shredding schema for `VariantArrayBuilder` #7921

friendlymatthew · 2025-07-13T01:51:56Z

Which issue does this PR close?

Part of [Variant] API to construct Shredded Variant Arrays #7895

My initial PR is getting too large so I figured it would be better to split these up.

Rationale for this change

This PR updates the VariantArrayBuilder to pass in the desired shredded output schema in the constructor. It also contains validation logic that defines what is a valid schema and what is not.

In other words, the schema that you define ahead of time gets checked if it is spec-compliant

friendlymatthew · 2025-07-13T01:52:21Z

cc @scovich @alamb @Samyak2

friendlymatthew · 2025-07-13T01:52:57Z

parquet-variant-compute/src/shredding.rs

+        // if !value_field.is_nullable() {
+        //     return Err(ArrowError::InvalidArgumentError(
+        //         "Expected value field to be nullable".to_string(),
+        //     ));
+        // }


I did not see anything in the shredding spec that explicitly states value can not be nullable. Same thing with typed_value below

Spec says that both can be nullable:

required group measurement (VARIANT) { required binary metadata; optional binary value; optional int64 typed_value; }

What I don't know, is what should happen with metadata in a variant that shredded as a deeply nested struct, where one or more of those struct fields happen to be variant typed (which in turn can also be shredded further, and which can also contain variant fields of their own?

v: VARIANT { metadata: BINARY, value: BINARY, typed_value: { a: STRUCT { b: STRUCT { c: STRUCT { w: VARIANT { metadata: BINARY, << --- ??? value: BINARY, typed_value: STRUCT { x: STRUCT { y: STRUCT { z: STRUCT { u: VARIANT { metadata: BINARY, <<--- ??? value: BINARY, } } } } } } } } } } }

The spec says that

Variant metadata is stored in the top-level Variant group in a binary metadata column regardless of whether the Variant value is shredded.

All value columns within the Variant must use the same metadata. All field names of a Variant, whether shredded or not, must be present in the metadata.

I'm pretty sure that means w and u must not have metadata columns -- because they are still "inside" v.

Even if one tried to store path names of u inside all three metadata columns, the field ids would disagree unless we forced u.metadata and w.metadata to be copies of v.metadata. Easy enough to do that in arrow-rs (all array data are anyway Arc), but what about the actual parquet file??

I'm not sure the spec is 100% clear on this one, unfortunately. Maybe we need to ask the parquet-variant folks for clarification and/or refinement of the spec's wording?

There should only be one metadata field, at the top level. So a metadata field at w is not allowed.

Also, the immediately a.b.c fields are not valid for a shredded variant. Every shredded field needs a new typed_value and/or value field directly under it. So the path in parquet for v:a.b.c.w.x.y.z should be a.typed_value.b.typed_value.c.typed_value.w.typed_value.x.typed_value.y.typed_value.z.typed_value.u.value.

The relevant paragraph from the spec for my second comment:

Each shredded field in the typed_value group is represented as a required group that contains optional value and typed_value fields. The value field stores the value as Variant-encoded binary when the typed_value cannot represent the field. This layout enables readers to skip data based on the field statistics for value and typed_value. The typed_value field may be omitted when not shredding fields as a specific type.

One more slightly pedantic clarification: the type of w and u in parquet are not VARIANT (i.e. only v is annotated with the Variant logical type). u and v are just shredded fields that happen to not have a typed_value field, only value.

friendlymatthew · 2025-07-13T01:54:05Z

parquet-variant-compute/src/shredding.rs

+    if metadata_field.is_nullable() {
+        return Err(ArrowError::InvalidArgumentError(
+            "Invalid VariantArray: metadata field can not be nullable".to_string(),
+        ));
+    }


I make sure to check metadata is not nullable. But I wonder if we should remove this. You could imagine a user wanting to use the same metadata throughout the entire building process?

I think for variant columns nested inside a shredded variant, we must not have a metadata column?
See https://github.com/apache/arrow-rs/pull/7921/files#r2204684924 above?

Yes, that is how I understood it as well.

validate_shredded_schema is called at the top-level, if nested schemas exist, we recursively call validate_value_and_typed_value

This way, we only validate the metadata column once and at the top level

friendlymatthew · 2025-07-13T02:00:53Z

parquet-variant-compute/src/shredding.rs

+        // this is directly mapped from the spec's parquet physical types
+        // note, there are more data types we can support
+        // but for the sake of simplicity, I chose the smallest subset
+        match typed_value_field.data_type() {
+            DataType::Boolean
+            | DataType::Int32
+            | DataType::Int64
+            | DataType::Float32
+            | DataType::Float64
+            | DataType::BinaryView => {}
+            DataType::Union(union_fields, _) => {
+                union_fields
+                    .iter()
+                    .map(|(_, f)| f.clone())
+                    .try_for_each(|f| {
+                        let DataType::Struct(fields) = f.data_type().clone() else {
+                            return Err(ArrowError::InvalidArgumentError(
+                                "Expected struct".to_string(),
+                            ));
+                        };
+
+                        validate_value_and_typed_value(&fields, false)
+                    })?;
+            }
+
+            foreign => {
+                return Err(ArrowError::NotYetImplemented(format!(
+                    "Unsupported VariantArray 'typed_value' field, got {foreign}"
+                )))
+            }
+        }
+    }


I don't love this, but I treat the field DataTypes as the parquet physical type defined in the specification: https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types.

I'm curious to get your thoughts, maybe we should stick with the Variant type mapping?

One reason why the current logic isn't the best is because when we go to reconstruct variants, certain variant types like int8 will get casted to a DataType::Int32. This means when we go to encode the values back to variant, we won't know their original types

I don't think we need to store (logical) int32 in memory just because parquet physically encodes them that way? When reading an int8 column from normal parquet, doesn't it come back as an int8 PrimitiveArray?

friendlymatthew · 2025-07-13T02:02:01Z

parquet-variant-compute/src/shredding.rs

+        // Create union of different element types
+        let union_fields = UnionFields::new(
+            vec![0, 1],
+            vec![
+                Field::new("string_element", string_element, true),
+                Field::new("int_element", int_element, true),
+            ],
+        );


I don't love this, the field names are weird. However, we need a way to support a heterogenous list of Fields.

I'm curious if there is a nicer way to represent a group.

friendlymatthew · 2025-07-13T02:02:43Z

parquet-variant-compute/src/shredding.rs

+        let typed_value_field = Field::new(
+            "typed_value",
+            DataType::Union(
+                UnionFields::new(
+                    vec![0, 1],
+                    vec![
+                        Field::new("event_type", DataType::Struct(element_group_1), true),
+                        Field::new("event_ts", DataType::Struct(element_group_2), true),
+                    ],
+                ),
+                UnionMode::Sparse,
+            ),
+            false,
+        );


Similar to https://github.com/apache/arrow-rs/pull/7921/files#r2203048613, but this is nicer, since we can treat field names as key names.

scovich

Didn't actually review yet, but wanted to at least respond to one comment.

scovich · 2025-07-14T11:27:55Z

parquet-variant-compute/src/shredding.rs

+        // if !value_field.is_nullable() {
+        //     return Err(ArrowError::InvalidArgumentError(
+        //         "Expected value field to be nullable".to_string(),
+        //     ));
+        // }


Spec says that both can be nullable:

required group measurement (VARIANT) { required binary metadata; optional binary value; optional int64 typed_value; }

scovich · 2025-07-14T11:38:48Z

parquet-variant-compute/src/shredding.rs

+        // if !value_field.is_nullable() {
+        //     return Err(ArrowError::InvalidArgumentError(
+        //         "Expected value field to be nullable".to_string(),
+        //     ));
+        // }


What I don't know, is what should happen with metadata in a variant that shredded as a deeply nested struct, where one or more of those struct fields happen to be variant typed (which in turn can also be shredded further, and which can also contain variant fields of their own?

v: VARIANT { metadata: BINARY, value: BINARY, typed_value: { a: STRUCT { b: STRUCT { c: STRUCT { w: VARIANT { metadata: BINARY, << --- ??? value: BINARY, typed_value: STRUCT { x: STRUCT { y: STRUCT { z: STRUCT { u: VARIANT { metadata: BINARY, <<--- ??? value: BINARY, } } } } } } } } } } }

The spec says that

Variant metadata is stored in the top-level Variant group in a binary metadata column regardless of whether the Variant value is shredded.

All value columns within the Variant must use the same metadata. All field names of a Variant, whether shredded or not, must be present in the metadata.

I'm pretty sure that means w and u must not have metadata columns -- because they are still "inside" v.

Even if one tried to store path names of u inside all three metadata columns, the field ids would disagree unless we forced u.metadata and w.metadata to be copies of v.metadata. Easy enough to do that in arrow-rs (all array data are anyway Arc), but what about the actual parquet file??

scovich · 2025-07-14T14:22:55Z

parquet-variant-compute/src/shredding.rs

+        }
+    }
+
+    if let Some(typed_value_field) = fields.iter().find(|f| f.name() == TYPED_VALUE) {


Suggested change

if let Some(typed_value_field) = fields.iter().find(|f| f.name() == TYPED_VALUE) {

if let Some(typed_value_field) = typed_value_field_res {

scovich · 2025-07-14T15:12:10Z

parquet-variant-compute/src/shredding.rs

+        // this is directly mapped from the spec's parquet physical types
+        // note, there are more data types we can support
+        // but for the sake of simplicity, I chose the smallest subset
+        match typed_value_field.data_type() {
+            DataType::Boolean
+            | DataType::Int32
+            | DataType::Int64
+            | DataType::Float32
+            | DataType::Float64
+            | DataType::BinaryView => {}
+            DataType::Union(union_fields, _) => {
+                union_fields
+                    .iter()
+                    .map(|(_, f)| f.clone())
+                    .try_for_each(|f| {
+                        let DataType::Struct(fields) = f.data_type().clone() else {
+                            return Err(ArrowError::InvalidArgumentError(
+                                "Expected struct".to_string(),
+                            ));
+                        };
+
+                        validate_value_and_typed_value(&fields, false)
+                    })?;
+            }
+
+            foreign => {
+                return Err(ArrowError::NotYetImplemented(format!(
+                    "Unsupported VariantArray 'typed_value' field, got {foreign}"
+                )))
+            }
+        }
+    }


I don't think we need to store (logical) int32 in memory just because parquet physically encodes them that way? When reading an int8 column from normal parquet, doesn't it come back as an int8 PrimitiveArray?

scovich · 2025-07-14T15:25:02Z

parquet-variant-compute/src/shredding.rs

+            | DataType::Float32
+            | DataType::Float64
+            | DataType::BinaryView => {}
+            DataType::Union(union_fields, _) => {


My initial reaction was that I don't think variant data can represent a union type?

I guess this is a way to slightly relax strongly typed data, as long as the union members themselves are all valid variant types? And whichever union member is active becomes the only field+value of a variant object? But how would a reader of that shredded data know to read it back as a union, instead of the (sparse) struct it appears to be? Would it be better to just require a struct from the start?

I recommend we start without support for union type and then add it as we implement additional functionality

scovich · 2025-07-14T15:26:33Z

parquet-variant-compute/src/shredding.rs

+    if metadata_field.is_nullable() {
+        return Err(ArrowError::InvalidArgumentError(
+            "Invalid VariantArray: metadata field can not be nullable".to_string(),
+        ));
+    }


I think for variant columns nested inside a shredded variant, we must not have a metadata column?
See https://github.com/apache/arrow-rs/pull/7921/files#r2204684924 above?

scovich · 2025-07-14T15:28:04Z

parquet-variant-compute/src/shredding.rs

+        // if !value_field.is_nullable() {
+        //     return Err(ArrowError::InvalidArgumentError(
+        //         "Expected value field to be nullable".to_string(),
+        //     ));
+        // }


I'm not sure the spec is 100% clear on this one, unfortunately. Maybe we need to ask the parquet-variant folks for clarification and/or refinement of the spec's wording?

alamb

Thank you @friendlymatthew -- this is looking quite cool

In order to make progress, I would suggest we try and write up a basic 'end to end' test.

Specifically, perhaps something like

Create a perfectly shredded Arrow array for a uint64 value
Implement the get_variant API from #7919 / @Samyak2 for that array and show how it will return Variant::Int64

And then we can expand those tests for the more exciting shredding variants afterwards

I think the tests in this PR for just the schemas are a bit too theoretical -- they would be much closer to the end user if they were used for actual StructArrays in tests

alamb · 2025-07-14T16:26:49Z

parquet-variant-compute/src/shredding.rs

+            | DataType::Float32
+            | DataType::Float64
+            | DataType::BinaryView => {}
+            DataType::Union(union_fields, _) => {


I recommend we start without support for union type and then add it as we implement additional functionality

alamb

Thank you @friendlymatthew -- I think this is quite close.

alamb · 2025-07-17T16:53:32Z

parquet-variant-compute/src/shredding.rs

+    Ok(())
+}
+
+/// Validates that the provided [`Fields`] conform to the Variant shredding specification.


One thing I thought of while reviewing this PR was maybe we could potentially wrap this into its own Structure, like

struct VariantSchema { inner: Fields }

And then all this validation logic could be part of the constructor

impl VariantSchema { fn try_new(fields: Fields) -> Result<Self> {... } ... }

The benefits of this would be

Now we could be sure that a validated schema was always passed to shred_variant

We would then have a place to put methods on -- such as VariantSchema::type(path: VariantPath) for retrieving the type of a particular path, perhaps

alamb

Thank you @friendlymatthew -- I think this is quite close.

scovich

Couple more comments, but mostly waiting for the PR to address existing comments

scovich · 2025-07-17T19:57:12Z

parquet-variant-compute/src/shredding.rs

+
+pub fn validate_value_and_typed_value(
+    fields: &Fields,
+    allow_both_null: bool,


When would we allow (or forbid) both fields being null?

My thinking was value and typed_value can be both null for objects.

Per the spec:

A field's value and typed_value are set to null (missing) to indicate that the field does not exist in the variant. To encode a field that is present with a null value, the value must contain a Variant null: basic type 0 (primitive) and physical type 0 (null).

parquet-variant-compute/src/shredding.rs

friendlymatthew · 2025-07-18T08:47:36Z

Last day at the conference! Will get back on the variant machine tomorrow. @scovich

scovich

Liking the newtype for validated schema!

scovich · 2025-07-21T11:50:02Z

parquet-variant-compute/src/shredding.rs

+                todo!("how does a shredded value look like?");
+                // ideally here, i would unpack the shredded_field
+                // and recursively call validate_value_and_typed_value with inside_shredded_object set to true


You already did the validation for leaf values at L110 above; maybe just finish the job there, by recursing on the DataType::ListView and DataType::Struct match arms?

parquet-variant-compute/src/shredding.rs

scovich · 2025-07-21T11:56:16Z

parquet-variant-compute/src/shredding.rs

+        match self.value_schema {
+            ValueSchema::MissingValue => None,
+            ValueSchema::ShreddedValue(_) => None,
+            ValueSchema::Value(value_idx) => Some(value_idx),
+            ValueSchema::PartiallyShredded { value_idx, .. } => Some(value_idx),


Could also consider:

Suggested change

match self.value_schema {

ValueSchema::MissingValue => None,

ValueSchema::ShreddedValue(_) => None,

ValueSchema::Value(value_idx) => Some(value_idx),

ValueSchema::PartiallyShredded { value_idx, .. } => Some(value_idx),

use ValueSchema::*;

match self.value_schema {

MissingValue | ShreddedValue(_) => None,

Value(value_idx) | PartiallyShredded { value_idx, .. } => Some(value_idx),

Less redundancy... but I'm not sure it actually improves readability very much?

scovich · 2025-07-21T11:59:57Z

parquet-variant-compute/src/shredding.rs

+    }
+
+    pub fn value(&self) -> Option<&FieldRef> {
+        self.value_idx().map(|i| self.inner.get(i).unwrap())


I realize the unwrap should be safe, but is there any harm in using flat_map instead to eliminate the possibility of panic?

Suggested change

self.value_idx().map(|i| self.inner.get(i).unwrap())

self.value_idx().flat_map(|i| self.inner.get(i))

Downside is, if the value index were ever incorrect, we would silently fail by returning None instead of panicking. But on the other hand, if the value index were ever incorrect, we're just as likely to silently return the wrong field rather than panic. I'm not sure panicking only some of the time actually helps?

(again below)

scovich · 2025-07-21T12:02:01Z

parquet-variant-compute/src/shredding.rs

+        match self.value_schema {
+            ValueSchema::MissingValue => None,
+            ValueSchema::Value(_) => None,
+            ValueSchema::ShreddedValue(shredded_idx) => Some(shredded_idx),
+            ValueSchema::PartiallyShredded {
+                shredded_value_idx, ..
+            } => Some(shredded_value_idx),


Similar to above, but I think this one is actually a readability improvement:

Suggested change

match self.value_schema {

ValueSchema::MissingValue => None,

ValueSchema::Value(_) => None,

ValueSchema::ShreddedValue(shredded_idx) => Some(shredded_idx),

ValueSchema::PartiallyShredded {

shredded_value_idx, ..

} => Some(shredded_value_idx),

use ValueSchema::*;

match self.value_schema {

MissingValue | Value(_) => None,

ShreddedValue(shredded_idx) | PartiallyShredded { shredded_value_idx, .. } => {

Some(shredded_value_idx)

}

Or, maybe it just needs the use ValueSchema::* part, to avoid the fmt line breaks on ShreddedValue?

Suggested change

match self.value_schema {

ValueSchema::MissingValue => None,

ValueSchema::Value(_) => None,

ValueSchema::ShreddedValue(shredded_idx) => Some(shredded_idx),

ValueSchema::PartiallyShredded {

shredded_value_idx, ..

} => Some(shredded_value_idx),

use ValueSchema::*;

match self.value_schema {

MissingValue => None,

Value(_) => None,

ShreddedValue(shredded_idx) => Some(shredded_idx),

PartiallyShredded { shredded_value_idx, .. } => Some(shredded_value_idx),

scovich · 2025-07-21T12:06:54Z

parquet-variant-compute/src/variant_array_builder.rs

-    /// TODO: 1) Add extension type metadata
-    /// TODO: 2) Add support for shredding


Are these TODO actually done? I didn't see anything related to 1/ in this PR, and I would think there's (a lot?) more to 2/ than just adding schema support?

Sorry, I plan on at least getting 2 done in this PR

parquet-variant-compute/src/variant_get.rs

alamb · 2025-07-24T22:15:38Z

👋 I am just checking in to see how this PR is going

I am not sure if it is correct, but I am thinking this PR is the first part of the "writing shredded variants" story

In order to drive it forward, I wonder if it might be a good idea to try and pick some simple test cases -- for example, try to write a test that produces the output VarantArray that is manually constructed in:

[Variant] WIP Tests for variant_get of shredded variants #7965

So perhaps that would mean a test like

// create a shredded schema that specifies shredding as Int32
let schema = ...; // Not sure??

// make an array builder with that schema
let mut builder = VariantArrayBuilder::new()
  .with_schema(shredded_schema);

// first row is value 34
let row0 = builder.variant_builder()
row0.append_variant(Variant::Int32(34));
row0.finish()
// second row is null
builder.append_null();
//third row is "n/a" (a string)
let row2 = builder.variant_builder()
row2.append_variant(Variant::from("n/a"));
row2.finish()
// fourth row is value 100
let row3 = builder.variant_builder()
row3.append_variant(Variant::Int32(100));
row3.finish()

// complete the array
let array = builder.finish();

// verify that the resulting array is a StructArray with the 
// structure specified in https://github.com/apache/arrow-rs/pull/7965

friendlymatthew commented Jul 13, 2025

View reviewed changes

friendlymatthew force-pushed the friendlymatthew/shred branch 2 times, most recently from 118ee3f to a2c8c72 Compare July 13, 2025 01:56

friendlymatthew commented Jul 13, 2025

View reviewed changes

friendlymatthew mentioned this pull request Jul 13, 2025

[Variant] VariantBuilder with VariantMetadata instead of MetadataBuilder #7915

Draft

scovich reviewed Jul 14, 2025

View reviewed changes

alamb reviewed Jul 14, 2025

View reviewed changes

This was referenced Jul 16, 2025

Variant shredding cmu-db/arrow-rs#2

Open

[VARIANT] Path-based Field Extraction for VariantArray #7946

Draft

alamb mentioned this pull request Jul 17, 2025

[Variant] API to construct Shredded Variant Arrays #7895

Open

alamb reviewed Jul 17, 2025

View reviewed changes

scovich reviewed Jul 17, 2025

View reviewed changes

friendlymatthew marked this pull request as draft July 17, 2025 20:47

alamb mentioned this pull request Jul 18, 2025

[Variant] Support variant_get kernel for shredded variants #7941

Open

friendlymatthew force-pushed the friendlymatthew/shred branch 2 times, most recently from 79248ae to 9ba8609 Compare July 19, 2025 09:41

Validate schema

debe856

friendlymatthew force-pushed the friendlymatthew/shred branch from 9ba8609 to debe856 Compare July 19, 2025 09:45

Introduce new type VariantSchema

ef79fe0

friendlymatthew force-pushed the friendlymatthew/shred branch 4 times, most recently from 609960f to fb7ba41 Compare July 20, 2025 12:20

scovich reviewed Jul 21, 2025

View reviewed changes

friendlymatthew force-pushed the friendlymatthew/shred branch 3 times, most recently from 0d7f90c to cce2453 Compare July 21, 2025 15:32

Initial plumbing into variant_array and builder

a71380a

friendlymatthew force-pushed the friendlymatthew/shred branch from cce2453 to a71380a Compare July 21, 2025 15:32

Recursively validate lists and objects

94721f0

	if let Some(typed_value_field) = fields.iter().find(\|f\| f.name() == TYPED_VALUE) {
	if let Some(typed_value_field) = typed_value_field_res {

	self.value_idx().map(\|i\| self.inner.get(i).unwrap())
	self.value_idx().flat_map(\|i\| self.inner.get(i))

		/// TODO: 1) Add extension type metadata
		/// TODO: 2) Add support for shredding

[Variant] Define shredding schema for VariantArrayBuilder #7921

Are you sure you want to change the base?

[Variant] Define shredding schema for VariantArrayBuilder #7921

Conversation

friendlymatthew commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

Uh oh!

friendlymatthew commented Jul 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cashmand Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

[Variant] Define shredding schema for `VariantArrayBuilder` #7921

[Variant] Define shredding schema for `VariantArrayBuilder` #7921

friendlymatthew commented Jul 13, 2025 •

edited

Loading

cashmand Jul 14, 2025 •

edited

Loading

friendlymatthew Jul 19, 2025 •

edited

Loading

alamb left a comment •

edited

Loading