[WIP] [Feature] Vastly improved, shardless PyArrow performance on large datasets#59
Open
sami-bg wants to merge 9 commits intogalilai-group:mainfrom
Open
[WIP] [Feature] Vastly improved, shardless PyArrow performance on large datasets#59sami-bg wants to merge 9 commits intogalilai-group:mainfrom
sami-bg wants to merge 9 commits intogalilai-group:mainfrom
Conversation
ba4a6c8 to
ed77b41
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
TL;DR:
train_test_split()no longer OOMs on ImageNet-1k (or any large dataset). Instead of materializing the entire Arrow table into memory to split it, we now use an index indirection layer. It's just a small uint64 array that maps "virtual" row positions to physical positions in the on-disk shards. Shuffling, filtering, splitting, and slicing all just create a new index array pointing at the same shard files.This also adds a few things that were missing from the original PyArrow migration: IPC compression (zstd by default, cuts disk usage roughly in half), parallel image encoding during cache construction via a thread pool, an
IterableDatasetwrapper with proper worker sharding and a reservoir-based row-level shuffle for training, a format pipeline (with_format("torch") gives you CHW float tensors, "raw" skips PIL decode entirely for DALI-style pipelines), and resumable downloads with optional checksum validation.The key idea borrowed from HuggingFace is their _indices trick: every "derived" dataset (the train split after train_test_split, the output of
shuffle(), a filtered subset) is just a thin view object holding an index array and a pointer to the same shards on disk. Composition is transitive, sods.select(A).select(B)is a single numpy gather.Also includes 31 new tests cover the new functionality across all five areas.
Fixes # (issue)
Before submitting
Pull Request section?
documentation guidelines
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.