fix(scan): size parquet scan splits by decoded GPU bytes, not encoded size by mbrobbel · Pull Request #950 · sirius-db/sirius

mbrobbel · 2026-06-15T21:15:52Z

Problem

parquet_gpu_ingestible partitions row groups into scan splits by capping on the parquet encoded-uncompressed byte size (ColumnChunk.total_uncompressed_size) against scan_task_batch_size. But the GPU memory a batch occupies is the decoded size — dictionary/RLE-encoded fixed-width columns decode to far more than their encoded size. So the cap badly under-counts GPU memory and produces oversized batches; on a tight GPU budget a single oversized batch cannot fit even after spill/reschedule, livelocking the OOM-reschedule loop until the retry cap trips and the query fails.

Observed on TPC-H Q1 over a ~60M-row lineitem read from S3 at a 5 GiB budget: the projected/scanned columns were ~370 MB encoded but ~3.8 GB decoded, so the 512 MB cap produced only 3 splits of ~1.3 GB decoded each → OOM.

Fix

Estimate decoded bytes per row group instead of encoded size:

Fixed-width columns: row_count × logical_type::fixed_width_byte_size() + a validity mask.
VARCHAR / nested / unknown: fall back to the encoded size + offset/validity bytes.
Types resolved via scan_plan data_column::primary_idx → returned_types for projected scans, and 1:1 with returned_types for unprojected (identity) scans; when the column count can't be aligned to types, the original encoded-size sizing is kept.

The decoded estimate also flows into parquet_split_info::estimated_bytes() (via row_group_slice::reserved_uncompressed_bytes), so the reservation system sizes tasks from the same figure. This matches the documented intent of approximate_batch_size (the duckdb-native coalescer already caps "decoded bytes per batch").

Verification

TPC-H Q1 over-S3 at a 5 GiB budget: 3 → 12 splits, passes (was OOM).
Full s3-test-large gate: 88 + 50 assertions, all pass.
s3-test regression: 63 cases, all pass.

Known follow-ups before ready-for-review (from adversarial review)

Pure-filter columns are excluded from the per-row-group estimate but are decoded to evaluate filters. A filter-heavy / small-output projected query could under-reserve and OOM. Account for filter-only columns in the materialization estimate (or split materialized-bytes from output-bytes). (Pre-existing exclusion, but this change makes estimated_bytes drive the reservation.)
8× scan heuristic double-counts: sirius_gpu_scan_operator::no_history_peak_memory_estimate multiplies a fresh scan's input basis by 8 (assuming pre-decode input), but the basis is now the decoded estimate — over-reserving on the first (no-history) task. Adjust the no-history estimate to recognize decoded scan bases (or keep a separate pre-decode basis).
Add focused tests for estimated_bytes() with filter-only columns and decoded-size split caps.

… size parquet_gpu_ingestible partitioned row groups into scan splits by capping on the parquet ENCODED-uncompressed byte size (ColumnChunk total_uncompressed_size) against scan_task_batch_size. But the GPU memory a batch occupies is the DECODED size: dictionary/RLE-encoded fixed-width columns decode to far more than their encoded size, so the cap badly under-counted GPU memory and produced oversized batches. On a tight GPU budget a single oversized batch cannot fit even after spill/reschedule, livelocking the OOM-reschedule loop until the retry cap trips and the query fails (observed on TPC-H Q1 over a ~60M-row lineitem at a 5 GiB budget: 3 splits of ~1.3 GB decoded each). Estimate decoded bytes per row group instead: for each column the reader materializes, fixed-width types contribute row_count x cuDF decoded width (logical_type::fixed_width_byte_size()) plus a validity mask; VARCHAR and nested types fall back to the encoded size plus offset/validity bytes. Types are resolved via scan_plan data_column primary_idx -> returned_types for projected scans, and 1:1 with returned_types for unprojected (identity) scans; when the column count cannot be aligned to types the original encoded-size sizing is kept. The decoded estimate also flows into parquet_split_info::estimated_bytes() (via row_group_slice::reserved_uncompressed_bytes), so the reservation system sizes tasks from the same figure. This matches the documented intent of approximate_batch_size (the duckdb-native coalescer caps "decoded bytes per batch"); the parquet path was measuring the wrong quantity. Closes the SF10 TPC-H Q1 over-S3 OOM at a 5 GiB budget; the standard s3-test (63 cases) and the full s3-test-large gate pass with no regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

aminaramoon · 2026-06-15T23:42:43Z

dont merge this, @mbrobbel .
@kckristensen see if you can take that into account while working on your coalescer

mbrobbel · 2026-06-16T08:06:08Z

dont merge this, @mbrobbel

This was not ready yet (draft), but can you elaborate?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(scan): size parquet scan splits by decoded GPU bytes, not encoded size#950

fix(scan): size parquet scan splits by decoded GPU bytes, not encoded size#950
mbrobbel wants to merge 1 commit into
sirius-db:devfrom
mbrobbel:fix/parquet-scan-decoded-batch-size

mbrobbel commented Jun 15, 2026

Uh oh!

aminaramoon commented Jun 15, 2026

Uh oh!

mbrobbel commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mbrobbel commented Jun 15, 2026

Problem

Fix

Verification

Known follow-ups before ready-for-review (from adversarial review)

Related

Uh oh!

aminaramoon commented Jun 15, 2026

Uh oh!

mbrobbel commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants