fix(scan): size parquet scan splits by decoded GPU bytes, not encoded size#950
Draft
mbrobbel wants to merge 1 commit into
Draft
fix(scan): size parquet scan splits by decoded GPU bytes, not encoded size#950mbrobbel wants to merge 1 commit into
mbrobbel wants to merge 1 commit into
Conversation
… size parquet_gpu_ingestible partitioned row groups into scan splits by capping on the parquet ENCODED-uncompressed byte size (ColumnChunk total_uncompressed_size) against scan_task_batch_size. But the GPU memory a batch occupies is the DECODED size: dictionary/RLE-encoded fixed-width columns decode to far more than their encoded size, so the cap badly under-counted GPU memory and produced oversized batches. On a tight GPU budget a single oversized batch cannot fit even after spill/reschedule, livelocking the OOM-reschedule loop until the retry cap trips and the query fails (observed on TPC-H Q1 over a ~60M-row lineitem at a 5 GiB budget: 3 splits of ~1.3 GB decoded each). Estimate decoded bytes per row group instead: for each column the reader materializes, fixed-width types contribute row_count x cuDF decoded width (logical_type::fixed_width_byte_size()) plus a validity mask; VARCHAR and nested types fall back to the encoded size plus offset/validity bytes. Types are resolved via scan_plan data_column primary_idx -> returned_types for projected scans, and 1:1 with returned_types for unprojected (identity) scans; when the column count cannot be aligned to types the original encoded-size sizing is kept. The decoded estimate also flows into parquet_split_info::estimated_bytes() (via row_group_slice::reserved_uncompressed_bytes), so the reservation system sizes tasks from the same figure. This matches the documented intent of approximate_batch_size (the duckdb-native coalescer caps "decoded bytes per batch"); the parquet path was measuring the wrong quantity. Closes the SF10 TPC-H Q1 over-S3 OOM at a 5 GiB budget; the standard s3-test (63 cases) and the full s3-test-large gate pass with no regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
|
dont merge this, @mbrobbel . |
Member
Author
This was not ready yet (draft), but can you elaborate? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
parquet_gpu_ingestiblepartitions row groups into scan splits by capping on the parquet encoded-uncompressed byte size (ColumnChunk.total_uncompressed_size) againstscan_task_batch_size. But the GPU memory a batch occupies is the decoded size — dictionary/RLE-encoded fixed-width columns decode to far more than their encoded size. So the cap badly under-counts GPU memory and produces oversized batches; on a tight GPU budget a single oversized batch cannot fit even after spill/reschedule, livelocking the OOM-reschedule loop until the retry cap trips and the query fails.Observed on TPC-H Q1 over a ~60M-row lineitem read from S3 at a 5 GiB budget: the projected/scanned columns were ~370 MB encoded but ~3.8 GB decoded, so the 512 MB cap produced only 3 splits of ~1.3 GB decoded each → OOM.
Fix
Estimate decoded bytes per row group instead of encoded size:
row_count × logical_type::fixed_width_byte_size()+ a validity mask.scan_plandata_column::primary_idx → returned_typesfor projected scans, and 1:1 withreturned_typesfor unprojected (identity) scans; when the column count can't be aligned to types, the original encoded-size sizing is kept.The decoded estimate also flows into
parquet_split_info::estimated_bytes()(viarow_group_slice::reserved_uncompressed_bytes), so the reservation system sizes tasks from the same figure. This matches the documented intent ofapproximate_batch_size(the duckdb-native coalescer already caps "decoded bytes per batch").Verification
s3-test-largegate: 88 + 50 assertions, all pass.s3-testregression: 63 cases, all pass.Known follow-ups before ready-for-review (from adversarial review)
estimated_bytesdrive the reservation.)sirius_gpu_scan_operator::no_history_peak_memory_estimatemultiplies a fresh scan's input basis by 8 (assuming pre-decode input), but the basis is now the decoded estimate — over-reserving on the first (no-history) task. Adjust the no-history estimate to recognize decoded scan bases (or keep a separate pre-decode basis).estimated_bytes()with filter-only columns and decoded-size split caps.Related
Complements (but does not depend on) the reservation-clamp fix in #945; the two were verified together for the 5 GiB large-Q1 case.
🤖 Generated with Claude Code