Add delete file index to pyiceberg and support equality delete reads #2255
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes #1210
Summary
This work was primarily done by @rutb327 while I provided guidance!
This PR adds equality delete read support to PyIceberg by implementing the delete file indexing system that matches delete files to data files, mimicking the behavior found in Iceberg Core. With this implementation we are able to index files and now read equality deletes during table scans.
Design details
Delete File Index
The new
DeleteFileIndex
class centralizes handling of all delete file types: positional deletes, equality deletes, and deletion vectors. It organizes deletes by type (equality vs. positional), partition (usingPartitionMap
for spec-aware grouping), and path (for path-specific positional deletes). This enables efficient lookup during table scans, reducing unnecessary delete file processing.Equality Delete support
Equality delete files are loaded as PyArrow Tables with their respective equality ids for the schema and for each we are grouping tables with the same set equality id's to reduce anti join operations.
Testing
Added tests from the core iceberg DeleteFileIndex test suite and added some tests with dummy files. As well as some manual testing with a flink setup.
Are there any user-facing changes?
Yes can read tables with equality deletes