Skip to content

fix(scan): size parquet scan splits by decoded GPU bytes, not encoded size#950

Draft
mbrobbel wants to merge 1 commit into
sirius-db:devfrom
mbrobbel:fix/parquet-scan-decoded-batch-size
Draft

fix(scan): size parquet scan splits by decoded GPU bytes, not encoded size#950
mbrobbel wants to merge 1 commit into
sirius-db:devfrom
mbrobbel:fix/parquet-scan-decoded-batch-size

Conversation

@mbrobbel

Copy link
Copy Markdown
Member

Problem

parquet_gpu_ingestible partitions row groups into scan splits by capping on the parquet encoded-uncompressed byte size (ColumnChunk.total_uncompressed_size) against scan_task_batch_size. But the GPU memory a batch occupies is the decoded size — dictionary/RLE-encoded fixed-width columns decode to far more than their encoded size. So the cap badly under-counts GPU memory and produces oversized batches; on a tight GPU budget a single oversized batch cannot fit even after spill/reschedule, livelocking the OOM-reschedule loop until the retry cap trips and the query fails.

Observed on TPC-H Q1 over a ~60M-row lineitem read from S3 at a 5 GiB budget: the projected/scanned columns were ~370 MB encoded but ~3.8 GB decoded, so the 512 MB cap produced only 3 splits of ~1.3 GB decoded each → OOM.

Fix

Estimate decoded bytes per row group instead of encoded size:

  • Fixed-width columns: row_count × logical_type::fixed_width_byte_size() + a validity mask.
  • VARCHAR / nested / unknown: fall back to the encoded size + offset/validity bytes.
  • Types resolved via scan_plan data_column::primary_idx → returned_types for projected scans, and 1:1 with returned_types for unprojected (identity) scans; when the column count can't be aligned to types, the original encoded-size sizing is kept.

The decoded estimate also flows into parquet_split_info::estimated_bytes() (via row_group_slice::reserved_uncompressed_bytes), so the reservation system sizes tasks from the same figure. This matches the documented intent of approximate_batch_size (the duckdb-native coalescer already caps "decoded bytes per batch").

Verification

  • TPC-H Q1 over-S3 at a 5 GiB budget: 3 → 12 splits, passes (was OOM).
  • Full s3-test-large gate: 88 + 50 assertions, all pass.
  • s3-test regression: 63 cases, all pass.

Known follow-ups before ready-for-review (from adversarial review)

  • Pure-filter columns are excluded from the per-row-group estimate but are decoded to evaluate filters. A filter-heavy / small-output projected query could under-reserve and OOM. Account for filter-only columns in the materialization estimate (or split materialized-bytes from output-bytes). (Pre-existing exclusion, but this change makes estimated_bytes drive the reservation.)
  • 8× scan heuristic double-counts: sirius_gpu_scan_operator::no_history_peak_memory_estimate multiplies a fresh scan's input basis by 8 (assuming pre-decode input), but the basis is now the decoded estimate — over-reserving on the first (no-history) task. Adjust the no-history estimate to recognize decoded scan bases (or keep a separate pre-decode basis).
  • Add focused tests for estimated_bytes() with filter-only columns and decoded-size split caps.

Related

Complements (but does not depend on) the reservation-clamp fix in #945; the two were verified together for the 5 GiB large-Q1 case.

🤖 Generated with Claude Code

… size

parquet_gpu_ingestible partitioned row groups into scan splits by capping on
the parquet ENCODED-uncompressed byte size (ColumnChunk total_uncompressed_size)
against scan_task_batch_size. But the GPU memory a batch occupies is the DECODED
size: dictionary/RLE-encoded fixed-width columns decode to far more than their
encoded size, so the cap badly under-counted GPU memory and produced oversized
batches. On a tight GPU budget a single oversized batch cannot fit even after
spill/reschedule, livelocking the OOM-reschedule loop until the retry cap trips
and the query fails (observed on TPC-H Q1 over a ~60M-row lineitem at a 5 GiB
budget: 3 splits of ~1.3 GB decoded each).

Estimate decoded bytes per row group instead: for each column the reader
materializes, fixed-width types contribute row_count x cuDF decoded width
(logical_type::fixed_width_byte_size()) plus a validity mask; VARCHAR and nested
types fall back to the encoded size plus offset/validity bytes. Types are
resolved via scan_plan data_column primary_idx -> returned_types for projected
scans, and 1:1 with returned_types for unprojected (identity) scans; when the
column count cannot be aligned to types the original encoded-size sizing is
kept. The decoded estimate also flows into parquet_split_info::estimated_bytes()
(via row_group_slice::reserved_uncompressed_bytes), so the reservation system
sizes tasks from the same figure.

This matches the documented intent of approximate_batch_size (the duckdb-native
coalescer caps "decoded bytes per batch"); the parquet path was measuring the
wrong quantity. Closes the SF10 TPC-H Q1 over-S3 OOM at a 5 GiB budget; the
standard s3-test (63 cases) and the full s3-test-large gate pass with no
regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@aminaramoon

Copy link
Copy Markdown
Collaborator

dont merge this, @mbrobbel .
@kckristensen see if you can take that into account while working on your coalescer

@mbrobbel

Copy link
Copy Markdown
Member Author

dont merge this, @mbrobbel

This was not ready yet (draft), but can you elaborate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants