EPIC: Support parallel scan in iceberg-datafusion

### What's the feature are you trying to implement?

As @colinmarc mention in https://apache-iceberg.slack.com/archives/C05HTENMJG4/p1753476472857519 , performance of iceberg-datafusion integration is lower than pure datafusion reading parquet file directly. As mention by @liurenjie1024 , it's caused by iceberg-datafusion integration only use one thread now. This issue propose a design to support parallel scan in iceberg-datafusion integration. And thanks to https://github.com/colinmarc/iceberg-datafusion-benchmarks from @colinmarc let us can dive into the bottleneck! 

## Row group based parallel scan 

 This parallel scan is row group based. The basic idea is to prune the file need scan into several group and pack them based on the parallism set by datafusion. The benefit of row group base parallel:
1. Parallel scan even in less file scene
2. More even distribute the read load after prune some row group in the file, e.g. https://github.com/apache/iceberg-rust/blob/d9fbc5c97e4d126a3850095beda725a8eb30229b/crates/iceberg/src/arrow/reader.rs#L241

The process can be describe as following:
- 1 is our [plan_file](https://github.com/apache/iceberg-rust/blob/d9fbc5c97e4d126a3850095beda725a8eb30229b/crates/iceberg/src/scan/mod.rs#L334) now, prune the iceberg metadata and get the FileScanTask finally. It's one data file bind with several delete file related to it.
- 2 We introduce a `GroupPruner`, which can attach the group info at FileScanTask and prune the some group based on predicate. We can implement group pruner for specific file format, e.g. `ParquetGroupPruner` to process different format of FileScanTask.
- 3 FileScanTask with group info can be split into multilple FileScanTask or merge. Base on the parallesim, we can repartition the FileScanTask into multiple PartitionTask which contain several FileScanTask.

Each partition task return as a RecordBatchStream when we execute using corresponding partition.

<img width="1315" height="340" alt="Image" src="https://github.com/user-attachments/assets/31a112cc-2305-47fa-8568-0322d03ed070" />


### Willingness to contribute

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EPIC: Support parallel scan in iceberg-datafusion #1604

What's the feature are you trying to implement?

Row group based parallel scan

Willingness to contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

EPIC: Support parallel scan in iceberg-datafusion #1604

Description

What's the feature are you trying to implement?

Row group based parallel scan

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions