-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Avoid extra allocation in object builder #7935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Avoid extra allocation in object builder #7935
Conversation
@alamb Please help to review this when you have time, thanks. |
parquet-variant/src/builder.rs
Outdated
(state, self.validate_unique_fields) | ||
let validate_unique_fields = self.validate_unique_fields; | ||
|
||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not find a better solution for this. I can change this if there is a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use the pattern in this pR: klion26#1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, it becomes cleaner now.
|
91cfb73
to
1f2bcc3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @klion26 -- I plan to review it carefully tomorrow
This commit will reuse the parent buffer for object builder. It can avoid the extra allocation for the object and the later buffer copy.
1f2bcc3
to
6096566
Compare
parquet-variant/src/builder.rs
Outdated
@@ -1064,20 +1084,58 @@ impl<'a> ObjectBuilder<'a> { | |||
key: &str, | |||
value: T, | |||
) -> Result<(), ArrowError> { | |||
// Get metadata_builder from parent state | |||
let metadata_builder = self.parent_state.metadata_builder(); | |||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a proposal of how to avoid the duplication:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@alamb thank you! I've rebased on the main branch, and hardened the nestes object test. |
6096566
to
442c935
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @klion26 -- this looks quite cool.
I think we should try and avoid the replication when possible -- I left a suggestion on how to do that and get the compiler to be happy
Likewise I think it would be nice to add a few more tests. Let me know if it makes sense
assert_eq!(inner_inner_object_d.len(), 1); | ||
assert_eq!(inner_inner_object_d.field_name(0).unwrap(), "cc"); | ||
assert_eq!(inner_inner_object_d.field(0).unwrap(), Variant::from("dd")); | ||
|
||
assert_eq!(outer_object.field_name(1).unwrap(), "b"); | ||
assert_eq!(outer_object.field(1).unwrap(), Variant::from(true)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add tests for the rollback behavior (as in starting an ObjectBuilder but not calling finish)
Similar we should test a list builder rollback too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will add more tests for this. Currently, the list inside the object depends on the from_json
test, add a unit test for it in builder.rs
is better, I'll add this.
Not sure if the tests like test_xx_no_finishi()
(such as test_object_builder_to_list_builder_inner_no_finish()
) are enough to cover the rollback logic?
We've called drop
in the test_xx_no_finish()
test, do we need to call drop
if we add tests to cover the rollaback logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing the tests, I agree that the existing tests are sufficient
arrow-rs/parquet-variant/src/builder.rs
Lines 2837 to 2918 in 0fecaa0
fn test_object_builder_to_list_builder_outer_no_finish() { | |
let mut builder = VariantBuilder::new(); | |
let mut object_builder = builder.new_object(); | |
object_builder.insert("first", 1i8); | |
// Create a nested list builder and finish it | |
let mut nested_list_builder = object_builder.new_list("nested"); | |
nested_list_builder.append_value("hi"); | |
nested_list_builder.finish(); | |
// Drop the outer object builder without finishing it | |
drop(object_builder); | |
builder.append_value(2i8); | |
// Only the second attempt should appear in the final variant | |
let (metadata, value) = builder.finish(); | |
let metadata = VariantMetadata::try_new(&metadata).unwrap(); | |
assert_eq!(metadata.len(), 2); | |
assert_eq!(&metadata[0], "first"); | |
assert_eq!(&metadata[1], "nested"); // not rolled back | |
let variant = Variant::try_new_with_metadata(metadata, &value).unwrap(); | |
assert_eq!(variant, Variant::Int8(2)); | |
} | |
#[test] | |
fn test_object_builder_to_object_builder_inner_no_finish() { | |
let mut builder = VariantBuilder::new(); | |
let mut object_builder = builder.new_object(); | |
object_builder.insert("first", 1i8); | |
// Create a nested object builder but never finish it | |
let mut nested_object_builder = object_builder.new_object("nested"); | |
nested_object_builder.insert("name", "unknown"); | |
drop(nested_object_builder); | |
object_builder.insert("second", 2i8); | |
// The parent object should only contain the original fields | |
object_builder.finish().unwrap(); | |
let (metadata, value) = builder.finish(); | |
let metadata = VariantMetadata::try_new(&metadata).unwrap(); | |
assert_eq!(metadata.len(), 3); | |
assert_eq!(&metadata[0], "first"); | |
assert_eq!(&metadata[1], "name"); // not rolled back | |
assert_eq!(&metadata[2], "second"); | |
let variant = Variant::try_new_with_metadata(metadata, &value).unwrap(); | |
let obj = variant.as_object().unwrap(); | |
assert_eq!(obj.len(), 2); | |
assert_eq!(obj.get("first"), Some(Variant::Int8(1))); | |
assert_eq!(obj.get("second"), Some(Variant::Int8(2))); | |
} | |
#[test] | |
fn test_object_builder_to_object_builder_outer_no_finish() { | |
let mut builder = VariantBuilder::new(); | |
let mut object_builder = builder.new_object(); | |
object_builder.insert("first", 1i8); | |
// Create a nested object builder and finish it | |
let mut nested_object_builder = object_builder.new_object("nested"); | |
nested_object_builder.insert("name", "unknown"); | |
nested_object_builder.finish().unwrap(); | |
// Drop the outer object builder without finishing it | |
drop(object_builder); | |
builder.append_value(2i8); | |
// Only the second attempt should appear in the final variant | |
let (metadata, value) = builder.finish(); | |
let metadata = VariantMetadata::try_new(&metadata).unwrap(); | |
assert_eq!(metadata.len(), 3); | |
assert_eq!(&metadata[0], "first"); // not rolled back | |
assert_eq!(&metadata[1], "name"); // not rolled back | |
assert_eq!(&metadata[2], "nested"); // not rolled back | |
let variant = Variant::try_new_with_metadata(metadata, &value).unwrap(); | |
assert_eq!(variant, Variant::Int8(2)); | |
} |
I also verified test coverage with
cargo llvm-cov --html test -p parquet-variant
parquet-variant/src/builder.rs
Outdated
@@ -999,8 +1000,17 @@ impl<'a> ListBuilder<'a> { | |||
let offset_size = int_size(data_size); | |||
|
|||
// Get parent's buffer | |||
let offset_shift = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice if this was a function in ParentState, something like
let offset_shift = match &self.parent_state { | |
let offset_shift = self.parent_state.object_start_offset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
@@ -1064,20 +1084,58 @@ impl<'a> ObjectBuilder<'a> { | |||
key: &str, | |||
value: T, | |||
) -> Result<(), ArrowError> { | |||
// Get metadata_builder from parent state | |||
let metadata_builder = self.parent_state.metadata_builder(); | |||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a proposal of how to avoid the duplication:
parquet-variant/src/builder.rs
Outdated
let start_offset = match &parent_state { | ||
ParentState::Variant { buffer, .. } => buffer.offset(), | ||
ParentState::List { buffer, .. } => buffer.offset(), | ||
ParentState::Object { buffer, .. } => buffer.offset(), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let start_offset = match &parent_state { | |
ParentState::Variant { buffer, .. } => buffer.offset(), | |
ParentState::List { buffer, .. } => buffer.offset(), | |
ParentState::Object { buffer, .. } => buffer.offset(), | |
}; | |
let start_offset = parent_state.buffer().offset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried this way, but it needs to change the parent_state
to mutable. Added a function to retrieve the current offset of the buffer.
parquet-variant/src/builder.rs
Outdated
(state, self.validate_unique_fields) | ||
let validate_unique_fields = self.validate_unique_fields; | ||
|
||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use the pattern in this pR: klion26#1
parquet-variant/src/builder.rs
Outdated
let data_size = self.buffer.offset(); | ||
let num_fields = self.fields.len(); | ||
let is_large = num_fields > u8::MAX as usize; | ||
let metadata_builder = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to put this into a method as well rather than an inline match statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
let starting_offset = self.object_start_offset; | ||
|
||
// Shift existing data to make room for the header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow on PR we can consider avoiding this extra splice somehow (by preallocating the size or something). Future work though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed an issue(#7960) to trace this.
parquet-variant/src/builder.rs
Outdated
buffer[header_pos..header_pos + offset_size as usize] | ||
.copy_from_slice(&data_size_bytes[..offset_size as usize]); | ||
|
||
let start_offset_shift = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is another place we could use the method and avoid an inline match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@alamb Thanks for the detailed review, will adressed them soon. |
e24d8e3
to
deb0782
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/builder.rs
Outdated
let start_offset = match &parent_state { | ||
ParentState::Variant { buffer, .. } => buffer.offset(), | ||
ParentState::List { buffer, .. } => buffer.offset(), | ||
ParentState::Object { buffer, .. } => buffer.offset(), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried this way, but it needs to change the parent_state
to mutable. Added a function to retrieve the current offset of the buffer.
|
||
let starting_offset = self.object_start_offset; | ||
|
||
// Shift existing data to make room for the header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed an issue(#7960) to trace this.
parquet-variant/src/builder.rs
Outdated
buffer[header_pos..header_pos + offset_size as usize] | ||
.copy_from_slice(&data_size_bytes[..offset_size as usize]); | ||
|
||
let start_offset_shift = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
let data_size = self.buffer.offset(); | ||
let num_fields = self.fields.len(); | ||
let is_large = num_fields > u8::MAX as usize; | ||
let metadata_builder = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
(state, self.validate_unique_fields) | ||
let validate_unique_fields = self.validate_unique_fields; | ||
|
||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, it becomes cleaner now.
parquet-variant/src/builder.rs
Outdated
@@ -1064,20 +1084,58 @@ impl<'a> ObjectBuilder<'a> { | |||
key: &str, | |||
value: T, | |||
) -> Result<(), ArrowError> { | |||
// Get metadata_builder from parent state | |||
let metadata_builder = self.parent_state.metadata_builder(); | |||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
@@ -999,8 +1000,17 @@ impl<'a> ListBuilder<'a> { | |||
let offset_size = int_size(data_size); | |||
|
|||
// Get parent's buffer | |||
let offset_shift = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
deb0782
to
f5b0465
Compare
🤖 |
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice @klion26 -- thank you very much for the contribution.
It is neat to see we can already see it making conersion from JSON 8% faster in some cases (see benchmark here #7935 (comment)):
batch_json_string_to_variant random_json(2633 bytes per document) 1.00 364.7±8.44ms ? ?/sec 1.08 392.4±6.39ms ? ?/sec
FYI @scovich @viirya and @friendlymatthew
parquet-variant/src/builder.rs
Outdated
|
||
// Shift existing data to make room for the header | ||
let buffer = parent_buffer.inner_mut(); | ||
buffer.splice(starting_offset..starting_offset, vec![0u8; header_size]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can avoid even this allocation:
buffer.splice(starting_offset..starting_offset, vec![0u8; header_size]); | |
buffer.splice(starting_offset..starting_offset, std::iter::repeat_n(0u8, header_size)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively -- that small allocation apparently isn't hurting performance. What if we create the temp vec with initial capacity of header_size
and then populate it with the actual header info, field ids, and offsets before splicing it in. That way, an incorrect header_size
calculation would not impact correctness:
let mut bytes_to_splice = Vec::with_capacity(header_size);
bytes_to_splice.push(header_byte);
if is_large {
bytes_to_splice.extend((num_fields as u32).to_le_bytes());
} else {
bytes_to_splice.push(num_fields as u8);
}
for &field_id in self.fields.keys() {
bytes_to_splice.extend(field_id.to_le_bytes().into_iter().take(id_size));
}
for &offset in self.fields.values() {
bytes_to_splice.extend((offset as u32).to_le_bytes().into_iter().take(offset_size));
}
bytes_to_splice.extend((data_size as u32).to_le_bytes().into_iter().take(offset_size));
buffer.splice(starting_offset..starting_offset, bytes_to_splice);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: I'd be very impressed if the compiler is smart enough to optimize away the allocation in vec![0u8; header_size].into_iter()
, which would potentially cause the above to run slower than the current code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I think we can improve the performance of appending packed u32
values to a Vec<u8>
. The following code:
pub fn append_bytes_brute2(dest: &mut Vec<u8>, src: u32, nbytes: NumBytes) {
let n = dest.len() + nbytes as usize;
dest.extend(src.to_le_bytes());
dest.truncate(n);
}
... unconditionally adds all four bytes and then truncates the vector to the desired length. This works because (a) it's a single machine instruction to copy the four LE bytes of a u32
to a memory location; (b) truncate drops the higher order bytes, when working with LE bytes; and (c) truncate
is dirt cheap (conditional move that doesn't break processor pipelines the way a branch would).
Result: Code that should perform the same regardless of the number of bytes we encode the u32
as, with only a single branch to check for vector capacity.
The above compiles to the following assembly code:
playground::append_packed_u32:
;; set up stack frame (not relevant in practice due to inlining)
...
movq (%rdi), %rcx ; dest capacity
movq 16(%rdi), %rbx ; dest len
subq %rbx, %rcx ; calculate available capacity
movq %rbx, %rax
cmpq $3, %rcx ; if insufficient capacity...
jbe .LBB2_1 ; ... then `reserve` more
.LBB2_2:
addq %rdx, %rbx ; add nbytes to len (truncate arg)
movq 8(%rdi), %rcx ; dest data pointer
movl %esi, (%rcx,%rax) ; append all four bytes of src to the vector
addq $4, %rax ; increase len by 4 (u32 size)
cmpq %rax, %rbx ; if new len is bigger than truncate arg...
cmovbq %rbx, %rax ; ... then truncate len
movq %rax, 16(%rdi) ; write back the updated dest len
;; tear down stack frame
...
retq
.LBB2_1:
;; call prologue
...
callq alloc::raw_vec::RawVecInner<A>::reserve::do_reserve_and_handle
;; call epilogue
...
jmp .LBB2_2
The one bummer is, when adding a reserve
followed by a loop over a slice of source bytes:
pub fn append_packed_u32(dest: &mut Vec<u8>, src: &[u32], nbytes: usize) {
dest.reserve((src.len() + 1) * nbytes);
for val in src {
let n = dest.len() + nbytes;
dest.extend(val.to_le_bytes());
dest.truncate(n);
}
}
... the compiler isn't smart enough to eliminate the redundant capacity check inside the loop. Oh, well.
@@ -1317,7 +1414,15 @@ impl<'a> ObjectBuilder<'a> { | |||
/// This is to ensure that the object is always finalized before its parent builder | |||
/// is finalized. | |||
impl Drop for ObjectBuilder<'_> { | |||
fn drop(&mut self) {} | |||
fn drop(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
assert_eq!(inner_inner_object_d.len(), 1); | ||
assert_eq!(inner_inner_object_d.field_name(0).unwrap(), "cc"); | ||
assert_eq!(inner_inner_object_d.field(0).unwrap(), Variant::from("dd")); | ||
|
||
assert_eq!(outer_object.field_name(1).unwrap(), "b"); | ||
assert_eq!(outer_object.field(1).unwrap(), Variant::from(true)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reviewing the tests, I agree that the existing tests are sufficient
arrow-rs/parquet-variant/src/builder.rs
Lines 2837 to 2918 in 0fecaa0
fn test_object_builder_to_list_builder_outer_no_finish() { | |
let mut builder = VariantBuilder::new(); | |
let mut object_builder = builder.new_object(); | |
object_builder.insert("first", 1i8); | |
// Create a nested list builder and finish it | |
let mut nested_list_builder = object_builder.new_list("nested"); | |
nested_list_builder.append_value("hi"); | |
nested_list_builder.finish(); | |
// Drop the outer object builder without finishing it | |
drop(object_builder); | |
builder.append_value(2i8); | |
// Only the second attempt should appear in the final variant | |
let (metadata, value) = builder.finish(); | |
let metadata = VariantMetadata::try_new(&metadata).unwrap(); | |
assert_eq!(metadata.len(), 2); | |
assert_eq!(&metadata[0], "first"); | |
assert_eq!(&metadata[1], "nested"); // not rolled back | |
let variant = Variant::try_new_with_metadata(metadata, &value).unwrap(); | |
assert_eq!(variant, Variant::Int8(2)); | |
} | |
#[test] | |
fn test_object_builder_to_object_builder_inner_no_finish() { | |
let mut builder = VariantBuilder::new(); | |
let mut object_builder = builder.new_object(); | |
object_builder.insert("first", 1i8); | |
// Create a nested object builder but never finish it | |
let mut nested_object_builder = object_builder.new_object("nested"); | |
nested_object_builder.insert("name", "unknown"); | |
drop(nested_object_builder); | |
object_builder.insert("second", 2i8); | |
// The parent object should only contain the original fields | |
object_builder.finish().unwrap(); | |
let (metadata, value) = builder.finish(); | |
let metadata = VariantMetadata::try_new(&metadata).unwrap(); | |
assert_eq!(metadata.len(), 3); | |
assert_eq!(&metadata[0], "first"); | |
assert_eq!(&metadata[1], "name"); // not rolled back | |
assert_eq!(&metadata[2], "second"); | |
let variant = Variant::try_new_with_metadata(metadata, &value).unwrap(); | |
let obj = variant.as_object().unwrap(); | |
assert_eq!(obj.len(), 2); | |
assert_eq!(obj.get("first"), Some(Variant::Int8(1))); | |
assert_eq!(obj.get("second"), Some(Variant::Int8(2))); | |
} | |
#[test] | |
fn test_object_builder_to_object_builder_outer_no_finish() { | |
let mut builder = VariantBuilder::new(); | |
let mut object_builder = builder.new_object(); | |
object_builder.insert("first", 1i8); | |
// Create a nested object builder and finish it | |
let mut nested_object_builder = object_builder.new_object("nested"); | |
nested_object_builder.insert("name", "unknown"); | |
nested_object_builder.finish().unwrap(); | |
// Drop the outer object builder without finishing it | |
drop(object_builder); | |
builder.append_value(2i8); | |
// Only the second attempt should appear in the final variant | |
let (metadata, value) = builder.finish(); | |
let metadata = VariantMetadata::try_new(&metadata).unwrap(); | |
assert_eq!(metadata.len(), 3); | |
assert_eq!(&metadata[0], "first"); // not rolled back | |
assert_eq!(&metadata[1], "name"); // not rolled back | |
assert_eq!(&metadata[2], "nested"); // not rolled back | |
let variant = Variant::try_new_with_metadata(metadata, &value).unwrap(); | |
assert_eq!(variant, Variant::Int8(2)); | |
} |
I also verified test coverage with
cargo llvm-cov --html test -p parquet-variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review, a bit scattered but hopefully helpful.
Impressive that it's already so much faster than the baseline, potentially with room to improve even further!
parquet-variant/src/builder.rs
Outdated
// returns the beginning offset of buffer for the parent if it is object builder, else 0. | ||
// for object builder will reuse the buffer from the parent, this is needed for `finish` | ||
// which needs the relative offset from the current variant. | ||
fn object_start_offset(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't variant array also have a starting offset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature that reuse the parent buffer hasn't been implemented for ListBuilder
. Will do it with a follow-up pr.
parquet-variant/src/builder.rs
Outdated
ParentState::Variant { | ||
buffer, | ||
metadata_builder, | ||
} => (buffer, metadata_builder), | ||
ParentState::List { | ||
buffer, | ||
metadata_builder, | ||
.. | ||
} => (buffer, metadata_builder), | ||
ParentState::Object { | ||
buffer, | ||
metadata_builder, | ||
.. | ||
} => (buffer, metadata_builder), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rust allows this:
ParentState::Variant { | |
buffer, | |
metadata_builder, | |
} => (buffer, metadata_builder), | |
ParentState::List { | |
buffer, | |
metadata_builder, | |
.. | |
} => (buffer, metadata_builder), | |
ParentState::Object { | |
buffer, | |
metadata_builder, | |
.. | |
} => (buffer, metadata_builder), | |
ParentState::Variant { | |
buffer, | |
metadata_builder, | |
} | | |
ParentState::List { | |
buffer, | |
metadata_builder, | |
.. | |
} | | |
ParentState::Object { | |
buffer, | |
metadata_builder, | |
.. | |
} => (buffer, metadata_builder), |
Not clear whether that's better or worse than the current code.
I'm also not sure how it would fmt
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
parquet-variant/src/builder.rs
Outdated
match self { | ||
ParentState::Variant { buffer, .. } => buffer.offset(), | ||
ParentState::Object { buffer, .. } => buffer.offset(), | ||
ParentState::List { buffer, .. } => buffer.offset(), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, can reduce redundancy by
match self { | |
ParentState::Variant { buffer, .. } => buffer.offset(), | |
ParentState::Object { buffer, .. } => buffer.offset(), | |
ParentState::List { buffer, .. } => buffer.offset(), | |
} | |
match self { | |
ParentState::Variant { buffer, .. } | |
| ParentState::Object { buffer, .. } | |
| ParentState::List { buffer, .. } => buffer.offset(), | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... we can just invoke ParentState::buffer
?
match self { | |
ParentState::Variant { buffer, .. } => buffer.offset(), | |
ParentState::Object { buffer, .. } => buffer.offset(), | |
ParentState::List { buffer, .. } => buffer.offset(), | |
} | |
self.buffer().offset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs the signature changed to buffer_current_offset(&mut self)
, not sure if this is ok?
changed to the first version
parquet-variant/src/builder.rs
Outdated
|
||
// Shift existing data to make room for the header | ||
let buffer = parent_buffer.inner_mut(); | ||
buffer.splice(starting_offset..starting_offset, vec![0u8; header_size]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can avoid even this allocation:
buffer.splice(starting_offset..starting_offset, vec![0u8; header_size]); | |
buffer.splice(starting_offset..starting_offset, std::iter::repeat_n(0u8, header_size)); |
let header_size = 1 + // header byte | ||
(if is_large { 4 } else { 1 }) + // num_fields | ||
(num_fields * id_size as usize) + // field IDs | ||
((num_fields + 1) * offset_size as usize); // field offsets + data_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I don't love about this approach is the independent calculation of the splice size vs. the bytes that actually get inserted. If they ever disagreed... badness as the insertions underflow or overflow the splice.
Ideally, we could produce an iterator that emits the desired bytes, and the splice itself can guarantee correct behavior. But for that to work the iterator would need to provide an accurate lower bound, so which rules out std::iter::from_fn
. Even if we did craft a custom iterator, computing its lower bound would basically be this same calculation all over again. We could also chain together a bunch of iterators, which preserves the lower bound, but somehow I doubt that would be efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like might almost work?
let field_ids = self.fields.keys().flat_map(|field_id| {
(field_id as usize).to_le_bytes().into_iter().take(id_size)
});
let offsets = self.fields.values().flat_map(|offset| {
offset.to_le_bytes().into_iter().take(offset_size)
});
let num_fields = num_fields.to_le_bytes().take(if is_large 4 else 1);
let header_and_num_fields = std::iter::once(header_byte).chain(num_fields);
let field_ids_and_offsets = field_ids.chain(offsets);
let bytes_to_splice = header_and_num_fields.chain(field_ids_and_offsets);
buffer.splice(starting_offset..starting_offset, bytes_to_splice);
... but unfortunately Iterator::flat_map
does not (honestly, cannot) compute an accurate lower bound size hint. So the splice
call would end up having to allocate an internal temp buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized -- a custom iterator that computes its size hint is still safer than the current code, because an incorrect size hint won't corrupt any data. It will just require extra allocations and/or byte shifting. There's still the question of performance tho. If it's drastically slower the extra safety probably wouldn't be worth it.
Another possibility is to take a hybrid approach -- rely on Iterator::chain
for most of the heavy lifting, but define a custom iterator for emitting offset/field_id arrays:
let num_fields = num_fields.to_le_bytes().take(if is_large 4 else 1);
let header_and_num_fields = std::iter::once(header_byte).chain(num_fields);
let field_ids = PackedU32Iterator::new(id_size, self.fields.keys().copied());
let offsets = PackedU32Iterator::new(offset_size, self.fields.values().map(|offset| *offset as u32));
let field_ids_and_offsets = field_ids.chain(offsets);
let bytes_to_splice = header_and_num_fields.chain(field_ids_and_offsets);
buffer.splice(starting_offset..starting_offset, bytes_to_splice);
PackedU32Iterator
struct PackedU32Iterator<T: impl Iterator<Item = [u8; 4]>> {
packed_bytes: usize,
iterator: T,
current_item: [u8; 4],
current_byte: usize, // 0..3
}
impl<T: impl Iterator<Item = [u8; 4]>> PackedU32Iterator<T> {
fn new(packed_bytes: usize, iterator: T) -> Self {
// eliminate corner cases in `next` by initializing with a fake already-consumed "first" item
Self {
packed_bytes,
iterator,
current_item: [0; 4],
current_byte: packed_bytes,
}
}
}
impl<T: impl Iterator<Item = [u8; 4]>> Iterator for PackedU32Iterator<T> {
fn size_hint(&self) -> (usize, Option<usize>) {
let lower = (packed_bytes - current_byte) + packed_bytes * iterator.size_hint().0;
(lower, None)
}
fn next(&mut self) -> Option<u8> {
if self.current_byte >= self.packed_bytes {
let Some(next_item) = self.iterator.next() else {
return None;
};
self.current_item = next_item;
self.current_byte = 0;
}
let rval = self.current_item[self.current_byte];
self.current_byte += 1;
Some(rval)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... but again, I worry that wrapping up all those for-loops inside an iterator for direct splicing will turn out to be a lot slower than the splice-then-overwrite approach the code currently takes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's ok to make this optimization a follow-up pr and add some benchmark for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or try the other approach that populates the tmp vec with non-zero bytes before splicing it into the main buffer? That's a lot simpler than this iterator-based suggestion, and likely performs better too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do it in a followup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think we can mitigate the danger of independent size calculation with testing.
I like the idea of going with the approach in this PR (which is already pretty large)
parquet-variant/src/builder.rs
Outdated
let ids = self.fields.keys().map(|id| *id as usize); | ||
parent_buffer.append_offset_array(ids, None, id_size); | ||
// Write field IDs | ||
for (&field_id, _) in &self.fields { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (&field_id, _) in &self.fields { | |
for field_id in self.fields.keys() { |
parquet-variant/src/builder.rs
Outdated
|
||
// Shift existing data to make room for the header | ||
let buffer = parent_buffer.inner_mut(); | ||
buffer.splice(starting_offset..starting_offset, vec![0u8; header_size]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively -- that small allocation apparently isn't hurting performance. What if we create the temp vec with initial capacity of header_size
and then populate it with the actual header info, field ids, and offsets before splicing it in. That way, an incorrect header_size
calculation would not impact correctness:
let mut bytes_to_splice = Vec::with_capacity(header_size);
bytes_to_splice.push(header_byte);
if is_large {
bytes_to_splice.extend((num_fields as u32).to_le_bytes());
} else {
bytes_to_splice.push(num_fields as u8);
}
for &field_id in self.fields.keys() {
bytes_to_splice.extend(field_id.to_le_bytes().into_iter().take(id_size));
}
for &offset in self.fields.values() {
bytes_to_splice.extend((offset as u32).to_le_bytes().into_iter().take(offset_size));
}
bytes_to_splice.extend((data_size as u32).to_le_bytes().into_iter().take(offset_size));
buffer.splice(starting_offset..starting_offset, bytes_to_splice);
parquet-variant/src/builder.rs
Outdated
let start_offset_shift = self.parent_state.object_start_offset(); | ||
self.parent_state | ||
.finish(starting_offset - start_offset_shift); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make the caller of ParentState::finish
extract the start offset? Seems like parent state can do that more easily and reliably on its own?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, moving this logic to ParentState::finish
is clearer. fixed.
parquet-variant/src/builder.rs
Outdated
if !self.has_been_finished { | ||
self.parent_state | ||
.buffer() | ||
.inner_mut() | ||
.truncate(self.object_start_offset); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, we should truncate the MetadataBuilder
back to the size it had when we started, to clean up any new field ids the failed builder might have created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that in previous pr #7865, some logics want to cover that the metadata hasn't been rolled back in tests like test_list_builder_to_object_builder_inner_no_finish
, do we need to roll back the MetadataBuilder
here?
// The parent list should only contain the original values
list_builder.finish();
let (metadata, value) = builder.finish();
let metadata = VariantMetadata::try_new(&metadata).unwrap();
assert_eq!(metadata.len(), 1);
assert_eq!(&metadata[0], "name"); // not rolled back
I updated this into a separate commit, we can keep it or revert it easily.
IIUC, the Metadata
can be rolled back because the object has not been written successfully, but not sure if there are any cases I did not follow here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, existing tests expect the fields to persist because the original code was unable to roll back the changes. With this PR, we should be able to roll back the changes cleanly, and update the unit tests accordingly.
parquet-variant/src/builder.rs
Outdated
|
||
// Shift existing data to make room for the header | ||
let buffer = parent_buffer.inner_mut(); | ||
buffer.splice(starting_offset..starting_offset, vec![0u8; header_size]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: I'd be very impressed if the compiler is smart enough to optimize away the allocation in vec![0u8; header_size].into_iter()
, which would potentially cause the above to run slower than the current code.
0fecaa0
to
c4d26db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich Thank you very much for the detailed review and suggestion, I've addressed most of them (add comments for the rest), please take another look when you're free.
parquet-variant/src/builder.rs
Outdated
// returns the beginning offset of buffer for the parent if it is object builder, else 0. | ||
// for object builder will reuse the buffer from the parent, this is needed for `finish` | ||
// which needs the relative offset from the current variant. | ||
fn object_start_offset(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature that reuse the parent buffer hasn't been implemented for ListBuilder
. Will do it with a follow-up pr.
parquet-variant/src/builder.rs
Outdated
match self { | ||
ParentState::Variant { buffer, .. } => buffer.offset(), | ||
ParentState::Object { buffer, .. } => buffer.offset(), | ||
ParentState::List { buffer, .. } => buffer.offset(), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs the signature changed to buffer_current_offset(&mut self)
, not sure if this is ok?
changed to the first version
parquet-variant/src/builder.rs
Outdated
let start_offset_shift = self.parent_state.object_start_offset(); | ||
self.parent_state | ||
.finish(starting_offset - start_offset_shift); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, moving this logic to ParentState::finish
is clearer. fixed.
parquet-variant/src/builder.rs
Outdated
if !self.has_been_finished { | ||
self.parent_state | ||
.buffer() | ||
.inner_mut() | ||
.truncate(self.object_start_offset); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that in previous pr #7865, some logics want to cover that the metadata hasn't been rolled back in tests like test_list_builder_to_object_builder_inner_no_finish
, do we need to roll back the MetadataBuilder
here?
// The parent list should only contain the original values
list_builder.finish();
let (metadata, value) = builder.finish();
let metadata = VariantMetadata::try_new(&metadata).unwrap();
assert_eq!(metadata.len(), 1);
assert_eq!(&metadata[0], "name"); // not rolled back
I updated this into a separate commit, we can keep it or revert it easily.
IIUC, the Metadata
can be rolled back because the object has not been written successfully, but not sure if there are any cases I did not follow here.
parquet-variant/src/builder.rs
Outdated
ParentState::Variant { | ||
buffer, | ||
metadata_builder, | ||
} => (buffer, metadata_builder), | ||
ParentState::List { | ||
buffer, | ||
metadata_builder, | ||
.. | ||
} => (buffer, metadata_builder), | ||
ParentState::Object { | ||
buffer, | ||
metadata_builder, | ||
.. | ||
} => (buffer, metadata_builder), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
}); | ||
|
||
let max_id = self.fields.iter().map(|(i, _)| *i).max().unwrap_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the length of the metadata builder doesn't have a correctness problem, but it will have a tiny problem that will waste some space for the header. added in a separate commit.
let header_size = 1 + // header byte | ||
(if is_large { 4 } else { 1 }) + // num_fields | ||
(num_fields * id_size as usize) + // field IDs | ||
((num_fields + 1) * offset_size as usize); // field offsets + data_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's ok to make this optimization a follow-up pr and add some benchmark for it?
@alamb thank you very much for the review and help, happy to see these improvements. |
c4d26db
to
bdf1f2d
Compare
parquet-variant/src/builder.rs
Outdated
// as object builder has been reused the parent buffer, | ||
// we need to shift the offset by the starting offset of the parent object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get it why put the comment here, seems ListBuilder.finish
hasn't been updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, because if we create a ListBuilder
inside an ObjectBuilder
, then we need do this shift, but after the last change, the comment here need to be moved to ParentState::finish()
.
parquet-variant/src/builder.rs
Outdated
@@ -1028,18 +1078,29 @@ impl Drop for ListBuilder<'_> { | |||
pub struct ObjectBuilder<'a> { | |||
parent_state: ParentState<'a>, | |||
fields: IndexMap<u32, usize>, // (field_id, offset) | |||
buffer: ValueBuffer, | |||
/// the starting offset in the parent's buffer where this object starts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style suggestion, it'd be better just following existing doc style.
/// the starting offset in the parent's buffer where this object starts | |
/// The starting offset in the parent's buffer where this object starts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar style issues on other comment/doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
if self.validate_unique_fields && !self.duplicate_fields.is_empty() { | ||
let metadata_builder = self.parent_state.metadata_builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm? Why move metadata_builder
inside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intended to avoid the double mutable reference problem, but it's not a problem after the implementation has changed. I can revert this if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this change in 19bb544 to keep the diff cleaner
parquet-variant/src/builder.rs
Outdated
@@ -506,6 +506,7 @@ enum ParentState<'a> { | |||
metadata_builder: &'a mut MetadataBuilder, | |||
fields: &'a mut IndexMap<u32, usize>, | |||
field_name: &'a str, | |||
object_start_offset: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object_start_offset: usize, | |
parent_offset_base: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested a more understandable name (at least to me).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, object_start_offset
wants to indicate that this is the offset of the current object start.
parquet-variant/src/builder.rs
Outdated
@@ -1028,18 +1078,29 @@ impl Drop for ListBuilder<'_> { | |||
pub struct ObjectBuilder<'a> { | |||
parent_state: ParentState<'a>, | |||
fields: IndexMap<u32, usize>, // (field_id, offset) | |||
buffer: ValueBuffer, | |||
/// the starting offset in the parent's buffer where this object starts | |||
object_start_offset: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object_start_offset: usize, | |
parent_offset_base: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
object_start_offset: usize, | ||
/// the starting offset in the parent's metadata buffer where this object starts | ||
/// used to truncate the written fields in `drop` if the current object has not been finished | ||
object_meta_start_offset: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object_meta_start_offset: usize, | |
parent_metadata_offset_base: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
// current object starts from `object_start_offset` | ||
let data_size = current_offset - self.object_start_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so we assume that no other object was appended at the same time? If we create two object builders and they insert into the parent buffer in an interleaved manner? Won't be the base offset (object start offset) same for them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I forgot that parent status has exclusively access to the parent buffer.
parquet-variant/src/builder.rs
Outdated
// the length of the metadata's field names is a very cheap to compute the upper bound. | ||
// it will almost always be a tight upper bound as well -- it would take a pretty | ||
// carefully crafted object to use only the early field ids of a large dictionary. | ||
let max_id = metadata_builder.field_names.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, honestly I didn't get what this comment trying to say. And the max_id
is same as the length of field_names
exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, metadata_builder.field_names.len()
returns the size of the underlying map, it may be bigger than the actual max field_id
, changing to this here is that it can avoid one pass for the field_ids
to calucalute the max_id
.
But this change can make test more diffict(more difficult to calculate the header size) -- such as the failed case from_json::test::test_json_to_variant_object_very_large
(test code here/failed ci)
@alamb @viirya @scovich Maybe we can revert this to the original code(travel one pass from filed_ids
) what do you think about this? thanks. -- added a separate commit which reverted this logic to see if any other tests failed.
parquet-variant/src/builder.rs
Outdated
// Write header byte | ||
let header = object_header(is_large, id_size, offset_size); | ||
parent_buffer.append_header(header, is_large, num_fields); | ||
buffer[header_pos] = header; | ||
header_pos += 1; | ||
|
||
// Write field IDs (sorted order) | ||
let ids = self.fields.keys().map(|id| *id as usize); | ||
parent_buffer.append_offset_array(ids, None, id_size); | ||
// Write number of fields | ||
if is_large { | ||
buffer[header_pos..header_pos + 4].copy_from_slice(&(num_fields as u32).to_le_bytes()); | ||
header_pos += 4; | ||
} else { | ||
buffer[header_pos] = num_fields as u8; | ||
header_pos += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically append_header
did before, right? Maybe we can also extract to a function instead of inline here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
assert_eq!(metadata.len(), 1); | ||
assert_eq!(&metadata[0], "name"); // not rolled back | ||
assert_eq!(metadata.len(), 1); // rolled back | ||
assert_eq!(&metadata[0], "name"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why update this to "rolled back"? The metadata is same as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, my bad, this was unintentionally modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution. It looks reasonable and the benchmark looks good. I left a few comments. I also updated the PR description to more clearly describe what the changes are to help it understandable to others interested in this.
d11441d
to
5991cc7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Thanks for the detailed review, I've update the code, please take another look when you're free, thanks.
parquet-variant/src/builder.rs
Outdated
@@ -506,6 +506,7 @@ enum ParentState<'a> { | |||
metadata_builder: &'a mut MetadataBuilder, | |||
fields: &'a mut IndexMap<u32, usize>, | |||
field_name: &'a str, | |||
object_start_offset: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, object_start_offset
wants to indicate that this is the offset of the current object start.
parquet-variant/src/builder.rs
Outdated
if self.validate_unique_fields && !self.duplicate_fields.is_empty() { | ||
let metadata_builder = self.parent_state.metadata_builder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was intended to avoid the double mutable reference problem, but it's not a problem after the implementation has changed. I can revert this if needed.
parquet-variant/src/builder.rs
Outdated
assert_eq!(metadata.len(), 1); | ||
assert_eq!(&metadata[0], "name"); // not rolled back | ||
assert_eq!(metadata.len(), 1); // rolled back | ||
assert_eq!(&metadata[0], "name"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, my bad, this was unintentionally modified.
parquet-variant/src/builder.rs
Outdated
@@ -1028,18 +1078,29 @@ impl Drop for ListBuilder<'_> { | |||
pub struct ObjectBuilder<'a> { | |||
parent_state: ParentState<'a>, | |||
fields: IndexMap<u32, usize>, // (field_id, offset) | |||
buffer: ValueBuffer, | |||
/// the starting offset in the parent's buffer where this object starts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
@@ -1028,18 +1078,29 @@ impl Drop for ListBuilder<'_> { | |||
pub struct ObjectBuilder<'a> { | |||
parent_state: ParentState<'a>, | |||
fields: IndexMap<u32, usize>, // (field_id, offset) | |||
buffer: ValueBuffer, | |||
/// the starting offset in the parent's buffer where this object starts | |||
object_start_offset: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
object_start_offset: usize, | ||
/// the starting offset in the parent's metadata buffer where this object starts | ||
/// used to truncate the written fields in `drop` if the current object has not been finished | ||
object_meta_start_offset: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
// the length of the metadata's field names is a very cheap to compute the upper bound. | ||
// it will almost always be a tight upper bound as well -- it would take a pretty | ||
// carefully crafted object to use only the early field ids of a large dictionary. | ||
let max_id = metadata_builder.field_names.len(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, metadata_builder.field_names.len()
returns the size of the underlying map, it may be bigger than the actual max field_id
, changing to this here is that it can avoid one pass for the field_ids
to calucalute the max_id
.
But this change can make test more diffict(more difficult to calculate the header size) -- such as the failed case from_json::test::test_json_to_variant_object_very_large
(test code here/failed ci)
@alamb @viirya @scovich Maybe we can revert this to the original code(travel one pass from filed_ids
) what do you think about this? thanks. -- added a separate commit which reverted this logic to see if any other tests failed.
b6af58d
to
f0d35de
Compare
I'm excited to take a look ASAP, but I'll be on the road a lot this week. Please don't take my absenteeism as a lack of interest! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct and is already faster than the baseline.
I'd love to see a follow-up that tries populating that temp vec with the actual header bytes instead of zeros -- hopefully it would give another nice speed boost on top of this one.
}); | ||
|
||
let max_id = self.fields.iter().map(|(i, _)| *i).max().unwrap_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it waste space? With high probability, the largest field id in the object will take the same number of bytes to encode as self.fields.len() - 1
(the highest field id in the dictionary), so using the latter shouldn't change anything?
let header_size = 1 + // header byte | ||
(if is_large { 4 } else { 1 }) + // num_fields | ||
(num_fields * id_size as usize) + // field IDs | ||
((num_fields + 1) * offset_size as usize); // field offsets + data_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or try the other approach that populates the tmp vec with non-zero bytes before splicing it into the main buffer? That's a lot simpler than this iterator-based suggestion, and likely performs better too.
parquet-variant/src/builder.rs
Outdated
self.parent_state | ||
.buffer() | ||
.inner_mut() | ||
.truncate(self.parent_offset_base); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that we have two offset bases, should we rename this one as parent_value_offset_base
for clarity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
assert_eq!(metadata.len(), 1); | ||
assert_eq!(&metadata[0], "name"); // not rolled back | ||
assert!(metadata.is_empty()); // rolled back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yay! 🎉
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich thanks for the review, parent_offset_base
has been renamed to parent_value_offset_base
, and fill the tmp vec with header before splice
improvements will be done with a follow-up pr.
}); | ||
|
||
let max_id = self.fields.iter().map(|(i, _)| *i).max().unwrap_or(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be no waste here, this is where I made a mistake before
let header_size = 1 + // header byte | ||
(if is_large { 4 } else { 1 }) + // num_fields | ||
(num_fields * id_size as usize) + // field IDs | ||
((num_fields + 1) * offset_size as usize); // field offsets + data_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do it in a followup
parquet-variant/src/builder.rs
Outdated
self.parent_state | ||
.buffer() | ||
.inner_mut() | ||
.truncate(self.parent_offset_base); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
…cation-in-object-builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again everyone -- this change is looking really nice now.
I went through this PR again and it looks to me like all the comments have been addressed, but I am not 100% sure
@scovich has approved the PR so I assume he is happy with this approach.
@viirya are you happy with this PR now or shall we wait for another round of reivew?
🤖 |
🤖: Benchmark completed Details
|
Ok, looks good to me. Thanks again. Let's address any additional comments as a follow on PR |
Which issue does this PR close?
ObjectBuilder
#7899 .This pr wants to avoid the extra allocation for the object builder and the later buffer copy.
Rationale for this change
Avoid extra allocation in the object builder like the issue descripted.
What changes are included in this PR?
This removes the internal
buffer
inObjectBuilder
. All data insertion is done directly to the parent buffer wrapped inparent_state
.The corresponding new fields are added to
ObjectBuilder
.object_start_offset
inObjectBuilder
, which describes the start offset in the parent buffer for the current objecthas_been_finished
inObjectBuilder
, which describes whether the current object has been finished; it will be used in theDrop
function.This patch modifies the logic of
new
,finish
,parent_state
, anddrop
function according to the change.In particular, it writes data into the parent buffer directly when adding a field to the object (i.e.,
insert
/try_insert
is called). When finalizing (finish
is called) the object, as header and field ids are must be put in front of data in the buffer, the builder will shift written data bytes for the necessary space for header and field ids. Then it writes header and field ids.In
drop
, if the builder is not finalized before being dropped, it will truncate the written bytes to roll back the parent buffer status.Are these changes tested?
The logic has been covered by the exist logic.
Are there any user-facing changes?
No