Skip to content

[WIP] [Feature] Vastly improved, shardless PyArrow performance on large datasets#59

Open
sami-bg wants to merge 9 commits intogalilai-group:mainfrom
sami-bg:pyarrow-migration-shardless
Open

[WIP] [Feature] Vastly improved, shardless PyArrow performance on large datasets#59
sami-bg wants to merge 9 commits intogalilai-group:mainfrom
sami-bg:pyarrow-migration-shardless

Conversation

@sami-bg
Copy link
Copy Markdown
Contributor

@sami-bg sami-bg commented Mar 20, 2026

What does this PR do?

TL;DR: train_test_split() no longer OOMs on ImageNet-1k (or any large dataset). Instead of materializing the entire Arrow table into memory to split it, we now use an index indirection layer. It's just a small uint64 array that maps "virtual" row positions to physical positions in the on-disk shards. Shuffling, filtering, splitting, and slicing all just create a new index array pointing at the same shard files.

This also adds a few things that were missing from the original PyArrow migration: IPC compression (zstd by default, cuts disk usage roughly in half), parallel image encoding during cache construction via a thread pool, an IterableDataset wrapper with proper worker sharding and a reservoir-based row-level shuffle for training, a format pipeline (with_format("torch") gives you CHW float tensors, "raw" skips PIL decode entirely for DALI-style pipelines), and resumable downloads with optional checksum validation.

The key idea borrowed from HuggingFace is their _indices trick: every "derived" dataset (the train split after train_test_split, the output of shuffle(), a filtered subset) is just a thin view object holding an index array and a pointer to the same shards on disk. Composition is transitive, so ds.select(A).select(B) is a single numpy gather.

Also includes 31 new tests cover the new functionality across all five areas.
Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sami-bg sami-bg force-pushed the pyarrow-migration-shardless branch from ba4a6c8 to ed77b41 Compare March 25, 2026 03:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant