[WIP] [Feature] Vastly improved, shardless PyArrow performance on large datasets by sami-bg · Pull Request #59 · galilai-group/stable-datasets

sami-bg · 2026-03-20T08:15:47Z

What does this PR do?

TL;DR: train_test_split() no longer OOMs on ImageNet-1k (or any large dataset). Instead of materializing the entire Arrow table into memory to split it, we now use an index indirection layer. It's just a small uint64 array that maps "virtual" row positions to physical positions in the on-disk shards. Shuffling, filtering, splitting, and slicing all just create a new index array pointing at the same shard files.

This also adds a few things that were missing from the original PyArrow migration: IPC compression (zstd by default, cuts disk usage roughly in half), parallel image encoding during cache construction via a thread pool, an IterableDataset wrapper with proper worker sharding and a reservoir-based row-level shuffle for training, a format pipeline (with_format("torch") gives you CHW float tensors, "raw" skips PIL decode entirely for DALI-style pipelines), and resumable downloads with optional checksum validation.

The key idea borrowed from HuggingFace is their _indices trick: every "derived" dataset (the train split after train_test_split, the output of shuffle(), a filtered subset) is just a thin view object holding an index array and a pointer to the same shards on disk. Composition is transitive, so ds.select(A).select(B) is a single numpy gather.

Also includes 31 new tests cover the new functionality across all five areas.
Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

sami-bg added 6 commits March 20, 2026 04:02

Better PyArrow for large large datasets

e4590d9

Better PyArrow for large large datasets

f5a91ff

precommit

c7a587f

rm comments

61ca3e2

new pyarrow

ba4a6c8

pyarrow

ed77b41

sami-bg force-pushed the pyarrow-migration-shardless branch from ba4a6c8 to ed77b41 Compare March 25, 2026 03:20

sami-bg added 3 commits March 24, 2026 23:21

merge

63d73ed

performance fixes

de807df

updates to pyarrow backend

1610217

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [Feature] Vastly improved, shardless PyArrow performance on large datasets#59

[WIP] [Feature] Vastly improved, shardless PyArrow performance on large datasets#59
sami-bg wants to merge 9 commits intogalilai-group:mainfrom
sami-bg:pyarrow-migration-shardless

sami-bg commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sami-bg commented Mar 20, 2026

What does this PR do?

Before submitting

Who can review?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant