Skip to content

Commit d4f1cfa

Browse files
authored
Implement Improved arrow-avro Reader Zero-Byte Record Handling (#7966)
# Which issue does this PR close? - Part of #4886 - Follow up to #7834 # Rationale for this change The initial Avro reader implementation contained an under-developed and temporary safeguard to prevent infinite loops when processing records that consumed zero bytes from the input buffer. When the `Decoder` reported that zero bytes were consumed, the `Reader` would advance it's cursor to the end of the current data block. While this successfully prevented an infinite loop, it had the critical side effect of silently discarding any remaining data in that block, leading to potential data loss. This change enhances the decoding logic to handle these zero-byte values correctly, ensuring that the `Reader` makes proper progress without dropping data and without risking an infinite loop. # What changes are included in this PR? - **Refined Decoder Logic**: The `Decoder` has been updated to accurately track and report the number of bytes consumed for all values, including valid zero-length records like `null` or empty `bytes`. This ensures the decoder always makes forward progress. - **Removal of Data-Skipping Safeguard**: The logic in the `Reader` that previously advanced to the end of a block on a zero-byte read has been removed. The reader now relies on the decoder to report accurate consumption and advances its cursor incrementally and safely. - * New integration test using a temporary `zero_byte.avro` file created via this python script: https://gist.github.com/jecsand838/e57647d0d12853f3cf07c350a6a40395 # Are these changes tested? Yes, a new `test_read_zero_byte_avro_file` test was added that reads the new `zero_byte.avro` file and confirms the update. # Are there any user-facing changes? N/A # Follow-Up PRs 1. PR to update `test_read_zero_byte_avro_file` once apache/arrow-testing#109 is merged in.
1 parent ed02131 commit d4f1cfa

File tree

2 files changed

+28
-8
lines changed

2 files changed

+28
-8
lines changed

arrow-avro/src/reader/mod.rs

Lines changed: 28 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -157,9 +157,10 @@ impl Decoder {
157157
let mut total_consumed = 0usize;
158158
while total_consumed < data.len() && self.decoded_rows < self.batch_size {
159159
let consumed = self.record_decoder.decode(&data[total_consumed..], 1)?;
160-
if consumed == 0 {
161-
break;
162-
}
160+
// A successful call to record_decoder.decode means one row was decoded.
161+
// If `consumed` is 0 on a non-empty buffer, it implies a valid zero-byte record.
162+
// We increment `decoded_rows` to mark progress and avoid an infinite loop.
163+
// We add `consumed` (which can be 0) to `total_consumed`.
163164
total_consumed += consumed;
164165
self.decoded_rows += 1;
165166
}
@@ -364,11 +365,7 @@ impl<R: BufRead> Reader<R> {
364365
}
365366
// Try to decode more rows from the current block.
366367
let consumed = self.decoder.decode(&self.block_data[self.block_cursor..])?;
367-
if consumed == 0 && self.block_cursor < self.block_data.len() {
368-
self.block_cursor = self.block_data.len();
369-
} else {
370-
self.block_cursor += consumed;
371-
}
368+
self.block_cursor += consumed;
372369
}
373370
self.decoder.flush()
374371
}
@@ -499,6 +496,29 @@ mod test {
499496
assert!(batch.column(0).as_any().is::<StringViewArray>());
500497
}
501498

499+
#[test]
500+
fn test_read_zero_byte_avro_file() {
501+
let batch = read_file("test/data/zero_byte.avro", 3, false);
502+
let schema = batch.schema();
503+
assert_eq!(schema.fields().len(), 1);
504+
let field = schema.field(0);
505+
assert_eq!(field.name(), "data");
506+
assert_eq!(field.data_type(), &DataType::Binary);
507+
assert!(field.is_nullable());
508+
assert_eq!(batch.num_rows(), 3);
509+
assert_eq!(batch.num_columns(), 1);
510+
let binary_array = batch
511+
.column(0)
512+
.as_any()
513+
.downcast_ref::<BinaryArray>()
514+
.unwrap();
515+
assert!(binary_array.is_null(0));
516+
assert!(binary_array.is_valid(1));
517+
assert_eq!(binary_array.value(1), b"");
518+
assert!(binary_array.is_valid(2));
519+
assert_eq!(binary_array.value(2), b"some bytes");
520+
}
521+
502522
#[test]
503523
fn test_alltypes() {
504524
let files = [

arrow-avro/test/data/zero_byte.avro

210 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)