Transfer learning, tuning batch size, torchelastic support
Overview
Highlights of this release are adding support for TorchElastic enables distributed PyTorch training jobs to be executed in a fault-tolerant and elastic manner; auto-scaling of batch size; new transfer learning example; an option to provide seed to random generators to ensure reproducibility.
Detail changes
Added
- Added callback for logging learning rates (#1498)
- Added transfer learning example (for a binary classification task in computer vision) (#1564)
- Added type hints in
Trainer.fit()andTrainer.test()to reflect that also a list of dataloaders can be passed in (#1723). - Added auto scaling of batch size (#1638)
- The progress bar metrics now also get updated in
training_epoch_end(#1724) - Enable
NeptuneLoggerto work withdistributed_backend=ddp(#1753) - Added option to provide seed to random generators to ensure reproducibility (#1572)
- Added override for hparams in
load_from_ckpt(#1797) - Added support multi-node distributed execution under
torchelastic(#1811, #1818) - Added using
store_truefor bool args (#1822, #1842) - Added dummy logger for internally disabling logging for some features (#1836)
Changed
- Enable
non-blockingfor device transfers to GPU (#1843) - Replace mata_tags.csv with hparams.yaml (#1271)
- Reduction when
batch_size < num_gpus(#1609) - Updated LightningTemplateModel to look more like Colab example (#1577)
- Don't convert
namedtupletotuplewhen transferring the batch to target device (#1589) - Allow passing
hparamsas a keyword argument to LightningModule when loading from checkpoint (#1639) - Args should come after the last positional argument (#1807)
- Made DDP the default if no backend specified with multiple GPUs (#1789)
Deprecated
- Deprecated
tags_csvin favor ofhparams_file(#1271)
Fixed
- Fixed broken link in PR template (#1675)
- Fixed ModelCheckpoint not None checking file path (#1654)
- Trainer now calls
on_load_checkpoint()when resuming from a checkpoint (#1666) - Fixed sampler logic for DDP with the iterable dataset (#1734)
- Fixed
_reset_eval_dataloader()for IterableDataset (#1560) - Fixed Horovod distributed backend to set the
root_gpuproperty (#1669) - Fixed wandb logger
global_stepaffects other loggers (#1492) - Fixed disabling progress bar on non-zero ranks using Horovod backend (#1709)
- Fixed bugs that prevent LP finder to be used together with early stopping and validation dataloaders (#1676)
- Fixed a bug in Trainer that prepended the checkpoint path with
version_when it shouldn't (#1748) - Fixed LR key name in case of param groups in LearningRateLogger (#1719)
- Fixed saving native AMP scaler state (introduced in #1561)
- Fixed accumulation parameter and suggestion method for learning rate finder (#1801)
- Fixed num processes wasn't being set properly and auto sampler was DDP failing (#1819)
- Fixed bugs in semantic segmentation example (#1824)
- Fixed saving native AMP scaler state (#1561, #1777)
- Fixed native AMP + DDP (#1788)
- Fixed
hparamlogging with metrics (#1647)
Contributors
@ashwinb, @awaelchli, @Borda, @cmpute, @festeh, @jbschiratti, @justusschock, @kepler, @kumuji, @nanddalal, @nathanbreitsch, @olineumann, @pitercl, @rohitgr7, @S-aiueo32, @SkafteNicki, @tgaddair, @tullie, @tw991, @williamFalcon, @ybrovman, @yukw777
If we forgot someone due to not matching commit email with GitHub account, let us know :]