Release Transfer learning, tuning batch size, torchelastic support · Lightning-AI/pytorch-lightning

Overview

Highlights of this release are adding support for TorchElastic enables distributed PyTorch training jobs to be executed in a fault-tolerant and elastic manner; auto-scaling of batch size; new transfer learning example; an option to provide seed to random generators to ensure reproducibility.

Detail changes

Added

Added callback for logging learning rates (#1498)
Added transfer learning example (for a binary classification task in computer vision) (#1564)
Added type hints in Trainer.fit() and Trainer.test() to reflect that also a list of dataloaders can be passed in (#1723).
Added auto scaling of batch size (#1638)
The progress bar metrics now also get updated in training_epoch_end (#1724)
Enable NeptuneLogger to work with distributed_backend=ddp (#1753)
Added option to provide seed to random generators to ensure reproducibility (#1572)
Added override for hparams in load_from_ckpt (#1797)
Added support multi-node distributed execution under torchelastic (#1811, #1818)
Added using store_true for bool args (#1822, #1842)
Added dummy logger for internally disabling logging for some features (#1836)

Changed

Enable non-blocking for device transfers to GPU (#1843)
Replace mata_tags.csv with hparams.yaml (#1271)
Reduction when batch_size < num_gpus (#1609)
Updated LightningTemplateModel to look more like Colab example (#1577)
Don't convert namedtuple to tuple when transferring the batch to target device (#1589)
Allow passing hparams as a keyword argument to LightningModule when loading from checkpoint (#1639)
Args should come after the last positional argument (#1807)
Made DDP the default if no backend specified with multiple GPUs (#1789)

Deprecated

Deprecated tags_csv in favor of hparams_file (#1271)

Fixed

Fixed broken link in PR template (#1675)
Fixed ModelCheckpoint not None checking file path (#1654)
Trainer now calls on_load_checkpoint() when resuming from a checkpoint (#1666)
Fixed sampler logic for DDP with the iterable dataset (#1734)
Fixed _reset_eval_dataloader() for IterableDataset (#1560)
Fixed Horovod distributed backend to set the root_gpu property (#1669)
Fixed wandb logger global_step affects other loggers (#1492)
Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
Fixed bugs that prevent LP finder to be used together with early stopping and validation dataloaders (#1676)
Fixed a bug in Trainer that prepended the checkpoint path with version_ when it shouldn't (#1748)
Fixed LR key name in case of param groups in LearningRateLogger (#1719)
Fixed saving native AMP scaler state (introduced in #1561)
Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
Fixed num processes wasn't being set properly and auto sampler was DDP failing (#1819)
Fixed bugs in semantic segmentation example (#1824)
Fixed saving native AMP scaler state (#1561, #1777)
Fixed native AMP + DDP (#1788)
Fixed hparam logging with metrics (#1647)

Contributors

@ashwinb, @awaelchli, @Borda, @cmpute, @festeh, @jbschiratti, @justusschock, @kepler, @kumuji, @nanddalal, @nathanbreitsch, @olineumann, @pitercl, @rohitgr7, @S-aiueo32, @SkafteNicki, @tgaddair, @tullie, @tw991, @williamFalcon, @ybrovman, @yukw777

If we forgot someone due to not matching commit email with GitHub account, let us know :]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transfer learning, tuning batch size, torchelastic support

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Overview

Detail changes

Added

Changed

Deprecated

Fixed

Contributors

Uh oh!