-
Notifications
You must be signed in to change notification settings - Fork 979
[Variant] WIP Tests for variant_get of shredded variants #7965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
FYI @Samyak2 @scovich @friendlymatthew and @klion26 and @carpecodeum as I think you are interested in this feature |
let struct_array = StructArrayBuilder::new() | ||
.with_field("metadata", Arc::new(metadata)) | ||
.with_field("typed_value", Arc::new(typed_value)) | ||
.build(); | ||
|
||
Arc::new( | ||
VariantArray::try_new(Arc::new(struct_array)).expect("should create variant array"), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understood of the spec, the value
field needs to be present but it can have nulls. Did I understand it incorrectly? If not, this struct would be an invalid variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reading of https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/ was that value
was optional in the case the variant shredded "perfectly"

I may also be misreading it.
Maybe @zeroshade has some time to coment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shredding spec definitely says both columns are optional:
Both
value
andtyped_value
are optional fields used together to encode a single value.
Additionally:
Values in the two fields must be interpreted according to the following table:
value
typed_value
Meaning null null The value is missing; only valid for shredded object fields non-null null The value is present and may be any type, including null null non-null The value is present and is the shredded type non-null non-null The value is present and is a partially shredded object
As always in parquet, a physically missing column is interpreted as all-null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @cashmand in case I totally misunderstood something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the intent is that a missing value
column should be treated as equivalent to a present-but-all-null value
column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, both are optional fields, but it's required to have at least one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks all!
@@ -177,4 +192,209 @@ mod test { | |||
r#"{"inner_field": 1234}"#, | |||
); | |||
} | |||
|
|||
/// Shredding: extract a value as a VariantArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR just adds tests here -- to show what I think the output of variant_get
should be for different variant cases.
@@ -146,7 +146,7 @@ impl VariantArrayBuilder { | |||
} | |||
|
|||
/// Append the [`Variant`] to the builder as the next row | |||
pub fn append_variant(&mut self, variant: Variant) { | |||
pub fn append_variant<'m, 'v>(&mut self, variant: impl Into<Variant<'m, 'v>>) { | |||
// TODO make this more efficient by avoiding the intermediate buffers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we implement the approach @scovich suggested about invoking primitive append_xxx methods directly instead of using intermediate VariantBuilder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like
pub fn append_variant<'m, 'v>(&mut self, variant: impl Into<Variant<'m, 'v>>) {
let variant = variant.into();
match variant {
Variant::Int32(v) => self.append_int32(v),
Variant::String(v) => self.append_string(v),
Variant::Object(obj) => {
// Build object directly into our buffers without intermediate VariantBuilder
self.append_object_start();
for field in obj.iter() {
self.append_variant(field.value()); // recursive
}
self.append_object_end();
}
// ... other variants
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are concerned about the intermediate buffers, I think after the following two PRs are merged:
- Convert JSON to VariantArray without copying (8 - 32% faster) #7911
- [Variant] Avoid extra allocation in object builder #7935
There will be a way to append variant values to the final output buffer directly without copying to intermediate buffers
Is there another potential part of your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, will follow these PRs then
assert_eq!(result.len(), 3); | ||
|
||
// Expect the values are the same as the original values | ||
assert_eq!(result.value(0), Variant::Int32(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the GetOptions
points to the leaf node, will the result
here be equal to the result
in test get_variant_perfectly_shredded_int32_as_int32
Which issue does this PR close?
variant_get
kernel for shredded variants #7941Rationale for this change
In order to make progress on accessing shredded variants, I think we need some tests showing how they are used and constructed. I am hoping some tests will help us along in figuring out how to implement this efficiently
I am hoping someone can use these tests to start implement efficient
variant_get
for shredded valuesWhat changes are included in this PR?
Implement
varaint_get
tests based on the examples in the arrow variant doc from @zeroshade (see apache/arrow#46908)Are these changes tested?
They are only tests
Are there any user-facing changes?
No