-
Notifications
You must be signed in to change notification settings - Fork 344
Description
What's the feature are you trying to implement?
Apache DataFusion Comet is an Apache Spark accelerator with Apache Iceberg support. We would like to enhance that support by leveraging Iceberg-Rust. You can find the details of this effort in the POC PR apache/datafusion-comet#2528 and in slides presented at the 10/9/25 Iceberg-Rust community call.
The short version is that Comet will rely on Apache Iceberg's Java integration with Apache Spark for planning, and then pass those generated FileScanTasks to Iceberg-Rust via a new DataFusion IcebergScan operator in Comet. We need a lot of new (or just public) APIs in the ArrowReader since we are bypassing the Table interface to avoid redundant (and possibly incorrect partitioned) planning. I will start to accumulate those efforts here.
One benefit of this approach is that I can run the Iceberg Java tests against Iceberg Rust's reader. There are gaps in features, so I hope to rapidly iterate on improving Iceberg Rust's reader to support them. I am not using Iceberg Rust's table interface or planning, so others will need to fill the gaps there, but I think this will greatly improve and harden Iceberg Rust's reader.
- Make
ArrowReaderBuilder::newpubinstead ofpub(crate). (feat(reader): Make ArrowReaderBuilder::new public #1748) - Expose
ArrowReaderOptionsinArrowReaderBuilder. This likely requires a new Iceberg-Rust Cargo feature like in DataFusion to enable theencryptionfeature for the Parquet crate. - Read Parquet files without field ID metadata (migrated tables) (feat(reader): position-based column projection for Parquet files without field IDs (migrated tables) #1777)
- Read Parquet files with both equality and position deletes (fix(reader): Support both position and equality delete files on the same FileScanTask #1778)
- Filter row groups when FileScanTask includes byte ranges (fix(reader): filter row groups when FileScanTask contains byte ranges #1779)
- Equality deletes with partial schemas (fix(reader): Equality delete files with partial schemas (containing only equality columns) #1782)
- Date32 support in RecordBatchTransformer (feat(reader): Add Date32 support to RecordBatchTransformer create_column #1792)
- Date32 default value from days since epoch, not just string (feat(reader): Date32 from days since epoch for Literal:try_from_json #1803)
- Field ID conflict resolution after addFiles (feat(reader): Add PartitionSpec support to FileScanTask and RecordBatchTransformer #1821)
- Support complex types in pushdown filters
- Support binary, fixedSizeBinary, and decimal(28+) partition values
- Bugs with position delete files and row group skipping (fix(reader): fix position delete bugs with row group skipping #1806)
- Failed to deserialize JSON struct with type field (fix: StructType fails to deserialize JSON with type field #1822)
Willingness to contribute
I can contribute to this feature independently