LitData v0.2.51
Lightning AI ⚡ is excited to announce the release of LitData v0.2.51
Highlights
Stream Raw Datasets from Cloud Storage (Beta)
Effortlessly stream raw files (e.g., images, text) directly from S3, GCS, or Azure cloud storage without preprocessing. Perfect for workflows needing immediate access to data in its original format.
from litdata.streaming.raw_dataset import StreamingRawDataset
from torch.utils.data import DataLoader
dataset = StreamingRawDataset("s3://bucket/files/")
# Use with PyTorch DataLoader
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
# Process raw bytes
passBenchmarks
Streaming speed for raw ImageNet (1.2M images) from cloud storage:
| Storage | Images/s (No Transform) | Images/s (With Transform) |
|---|---|---|
| AWS S3 | ~6,400 ± 100 | ~3,200 ± 100 |
| Google Cloud Storage | ~5,650 ± 100 | ~3,100 ± 100 |
Note: Use
StreamingRawDatasetfor direct data streaming. Opt forStreamingDatasetfor maximum speed with pre-optimized data.
Resume ParallelStreamingDataset
The ParallelStreamingDataset now supports a resume option, allowing you to seamlessly continue training from the previous epoch's state when cycling through datasets. Enable it with resume=True to avoid restarting at index 0 each epoch, ensuring consistent sample progression across epochs.
from litdata.streaming.parallel import ParallelStreamingDataset
from torch.utils.data import DataLoader
dataset = ParallelStreamingDataset(datasets=[dataset1, dataset2], length=100, resume=True)
loader = DataLoader(dataset, batch_size=32)
for batch in loader:
# Resumes from previous epoch's state
passPer-Dataset Batch Sizes in CombinedStreamingDataset
The CombinedStreamingDataset now supports per-dataset batch sizes when using batching_method="per_stream". Specify unique batch sizes for each dataset using set_batch_size() with a list of integers. The iterator respects these limits, switching datasets once the per-stream quota is met, optimizing GPU utilization for datasets with varying tensor sizes.
from litdata.streaming.combined import CombinedStreamingDataset
dataset = CombinedStreamingDataset(
datasets=[dataset1, dataset2],
weights=[0.5, 0.5],
batching_method="per_stream",
seed=123
)
dataset.set_batch_size([4, 8]) # Set batch sizes: 4 for dataset1, 8 for dataset2
for sample in dataset:
# Iterator yields samples respecting per-dataset batch size limits
passChanges
Added
- Added support for setting cache directory via
LITDATA_CACHE_DIRenvironment variable (#639 by @deependujha) - Added CLI option to clear default cache (#627 by @deependujha)
- Added resume support to
ParallelStreamingDataset(#650 by @philgzl) - Added
verboseoption tooptimize_fn(#654 by @deependujha) - Added support for multiple
transform_fninStreamingDataset(#655 by @deependujha) - Enabled per-dataset batch size support in
CombinedStreamingDataset(#635 by @MagellaX) - Added support for
StreamingRawDatasetto stream raw datasets from cloud storage (#652 by @bhimrazy) - Added GCP support for directory resolution in
resolve_dir(#659 by @bhimrazy)
Changed
- Cleaned up logic in
_loopby removing hacky index assignment (#640 by @deependujha) - Updated CODEOWNERS (#646 by @Borda)
- Switched to
astral-sh/setup-uvfor Python setup and useduv pipfor package installation (#656 by @bhimrazy) - Replaced PIL with torchvision's
decode_imagefor more robust JPEG deserialization (#660 by @bhimrazy)
Fixed
Chores
- Bumped
cryptographyfrom 42.0.8 to 45.0.4 (#644 by @dependabot[bot]) - Updated
numpyrequirement from <2.0 to <3.0 (#645 by @dependabot[bot]) - Bumped
pytest-timeoutfrom 2.3.1 to 2.4.0 (#643 by @dependabot[bot]) - Applied pre-commit suggestions & bumped Python to 3.9 (#653 by @pre-commit-ci[bot])
- Bumped
actions/first-interactionfrom 1 to 2 in GitHub Actions updates (#657 by @dependabot[bot]) - Bumped version to 0.2.51 (#664 by @bhimrazy)
Full Changelog: v0.2.50...v0.2.51
🧑💻 Contributors
We thank all folks who submitted issues, features, fixes and doc changes. It's the only way we can collectively make LitData better for everyone, nice job!
Key Contributors
@deependujha, @Borda, @bhimrazy, @philgzl
New Contributors
- @lukemerrick made their first contribution in #647
- @MagellaX made their first contribution in #635
Thank you ❤️ and we hope you'll keep them coming!