Release LitData v0.2.51 · Lightning-AI/litData

Lightning AI ⚡ is excited to announce the release of LitData v0.2.51

Highlights

Stream Raw Datasets from Cloud Storage (Beta)

Effortlessly stream raw files (e.g., images, text) directly from S3, GCS, or Azure cloud storage without preprocessing. Perfect for workflows needing immediate access to data in its original format.

from litdata.streaming.raw_dataset import StreamingRawDataset
from torch.utils.data import DataLoader

dataset = StreamingRawDataset("s3://bucket/files/")

# Use with PyTorch DataLoader
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
   # Process raw bytes
    pass

Benchmarks
Streaming speed for raw ImageNet (1.2M images) from cloud storage:

Storage	Images/s (No Transform)	Images/s (With Transform)
AWS S3	~6,400 ± 100	~3,200 ± 100
Google Cloud Storage	~5,650 ± 100	~3,100 ± 100

Note: Use StreamingRawDataset for direct data streaming. Opt for StreamingDataset for maximum speed with pre-optimized data.

Resume ParallelStreamingDataset

The ParallelStreamingDataset now supports a resume option, allowing you to seamlessly continue training from the previous epoch's state when cycling through datasets. Enable it with resume=True to avoid restarting at index 0 each epoch, ensuring consistent sample progression across epochs.

from litdata.streaming.parallel import ParallelStreamingDataset
from torch.utils.data import DataLoader

dataset = ParallelStreamingDataset(datasets=[dataset1, dataset2], length=100, resume=True)
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
    # Resumes from previous epoch's state
    pass

Per-Dataset Batch Sizes in CombinedStreamingDataset

The CombinedStreamingDataset now supports per-dataset batch sizes when using batching_method="per_stream". Specify unique batch sizes for each dataset using set_batch_size() with a list of integers. The iterator respects these limits, switching datasets once the per-stream quota is met, optimizing GPU utilization for datasets with varying tensor sizes.

from litdata.streaming.combined import CombinedStreamingDataset

dataset = CombinedStreamingDataset(
    datasets=[dataset1, dataset2],
    weights=[0.5, 0.5],
    batching_method="per_stream",
    seed=123
)
dataset.set_batch_size([4, 8])  # Set batch sizes: 4 for dataset1, 8 for dataset2

for sample in dataset:
    # Iterator yields samples respecting per-dataset batch size limits
    pass

Changes

Added

Added support for setting cache directory via LITDATA_CACHE_DIR environment variable (#639 by @deependujha)
Added CLI option to clear default cache (#627 by @deependujha)
Added resume support to ParallelStreamingDataset (#650 by @philgzl)
Added verbose option to optimize_fn (#654 by @deependujha)
Added support for multiple transform_fn in StreamingDataset (#655 by @deependujha)
Enabled per-dataset batch size support in CombinedStreamingDataset (#635 by @MagellaX)
Added support for StreamingRawDataset to stream raw datasets from cloud storage (#652 by @bhimrazy)
Added GCP support for directory resolution in resolve_dir (#659 by @bhimrazy)

Changed

Cleaned up logic in _loop by removing hacky index assignment (#640 by @deependujha)
Updated CODEOWNERS (#646 by @Borda)
Switched to astral-sh/setup-uv for Python setup and used uv pip for package installation (#656 by @bhimrazy)
Replaced PIL with torchvision's decode_image for more robust JPEG deserialization (#660 by @bhimrazy)

Fixed

Fixed performance issue with StreamingDataLoader when using ≥5 workers on Parquet data (#616 by @bhimrazy)
Fixed performance bottleneck in train_test_split (#647 by @lukemerrick)
Fixed async handling in StreamingRawDataset (#661 by @bhimrazy)

Chores

Bumped cryptography from 42.0.8 to 45.0.4 (#644 by @dependabot[bot])
Updated numpy requirement from <2.0 to <3.0 (#645 by @dependabot[bot])
Bumped pytest-timeout from 2.3.1 to 2.4.0 (#643 by @dependabot[bot])
Applied pre-commit suggestions & bumped Python to 3.9 (#653 by @pre-commit-ci[bot])
Bumped actions/first-interaction from 1 to 2 in GitHub Actions updates (#657 by @dependabot[bot])
Bumped version to 0.2.51 (#664 by @bhimrazy)

Full Changelog: v0.2.50...v0.2.51

🧑‍💻 Contributors

We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!

Key Contributors

@deependujha, @Borda, @bhimrazy, @philgzl

New Contributors

@lukemerrick made their first contribution in #647
@MagellaX made their first contribution in #635

Thank you ❤️ and we hope you'll keep them coming!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LitData v0.2.51

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Stream Raw Datasets from Cloud Storage (Beta)

Resume ParallelStreamingDataset

Per-Dataset Batch Sizes in CombinedStreamingDataset

Changes

🧑‍💻 Contributors

Key Contributors

New Contributors

Contributors

Uh oh!