TPU support & profiling
Pre-releaseOverview
This is the first joint release between pytorch-bearer and Lightning, here we come ...
This release adds support for training models on Tensor Processing Units (TPU). We can now train models on GPUs and TPUs by changing a single parameter in Trainer (see docs). We are also bringing the flexibility of Bearer into Lightning by allowing for arbitrary user-defined callbacks, see docs.
We are also including a profiler that allows Lightning users to identify training bottlenecks (see docs).
This release also includes automatic sampler setup depending on the selected backend, Lightning configures the sampler correctly (no need for user input).
The loggers have also been extended to support for multiple concurrent loggers to be passed to Trainer as an iterable, docs and added support for step-based learning rate scheduling.
At last, lots of bug fixes (see below).
Detail changes
Added
- Added automatic sampler setup. Depending on DDP or TPU, lightning configures the sampler correctly (user needs to do nothing) (#926)
- Added
reload_dataloaders_every_epoch=Falseflag for trainer. Some users require reloading data every epoch (#926) - Added
progress_bar_refresh_rate=50flag for trainer. The refresh rate on notebooks (#926) - Updated governance docs
- Added a check to ensure that the metric used for early stopping exists before training commences (#542)
- Added
optimizer_idxargument tobackwardhook (#733) - Added
entityargument toWandbLoggerto be passed towandb.init(#783) - Added a tool for profiling training runs (#782)
- Improved flexibility for naming of TensorBoard logs, can now set
versionto astrto just save to that directory, and usename=''to prevent experiment-name directory (#804) - Added option to specify
stepkey when logging metrics (#808) - Added
train_dataloader,val_dataloaderandtest_dataloaderarguments toTrainer.fit(), for alternative data parsing (#759) - Added Tensor Processing Unit (TPU) support (#868)
- Added semantic segmentation example (#751, #876, #881)
- Split callbacks in multiple files (#849)
- Support for user-defined callbacks (#889 and #950)
- Added support for multiple loggers to be passed to
Traineras an iterable (e.g. list, tuple, etc.) (#903) - Added support for step-based learning rate scheduling (#941)
- Added support for logging hparams as
dict(#1029) - Checkpoint and early stopping now work without val. step (#1041)
- Support graceful training cleanup after Keyboard Interrupt (#856, #1019)
- Added type hints for function arguments (#912)
- Added default
argparserforTrainer(#952, #1023) - Added TPU gradient clipping (#963)
- Added max/min number of steps in Trainer (#728)
Changed
- Changed default TQDM to use
tqdm.autofor prettier outputs in IPython notebooks (#752) - Changed
pytorch_lightning.loggingtopytorch_lightning.loggers(#767) - Moved the default
tqdm_dictdefinition from Trainer toLightningModule, so it can be overridden by the user (#749) - Moved functionality of
LightningModule.load_from_metricsintoLightningModule.load_from_checkpoint(#995) - Changed Checkpoint path parameter from
filepathtodirpath(#1016) - Freezed models
hparamsasNamespaceproperty (#1029) - Dropped
loggingconfig in package init (#1015) - Renames model steps (#1051)
training_end>>training_epoch_endvalidation_end>>validation_epoch_endtest_end>>test_epoch_end
- Refactor dataloading, supports infinite dataloader (#955)
- Create single file in
TensorBoardLogger(#777)
Deprecated
- Deprecated
pytorch_lightning.logging(#767) - Deprecated
LightningModule.load_from_metricsin favour ofLightningModule.load_from_checkpoint(#995, #1079) - Deprecated
@data_loaderdecorator (#926) - Deprecated model steps
training_end,validation_endandtest_end(#1051, #1056)
Removed
- Removed dependency on
pandas(#736) - Removed dependency on
torchvision(#797) - Removed dependency on
scikit-learn(#801)
Fixed
- Fixed a bug where early stopping
on_end_epochwould be called inconsistently whencheck_val_every_n_epoch == 0(#743) - Fixed a bug where the model checkpoint didn't write to the same directory as the logger (#771)
- Fixed a bug where the
TensorBoardLoggerclass would create an additional empty log file during fitting (#777) - Fixed a bug where
global_stepwas advanced incorrectly when usingaccumulate_grad_batches > 1(#832) - Fixed a bug when calling
self.logger.experimentwith multiple loggers (#1009) - Fixed a bug when calling
logger.append_tagson aNeptuneLoggerwith a single tag (#1009) - Fixed sending back data from
.spawnby saving and loading the trained model in/out of the process (#1017) - Fixed port collision on DDP (#1010)
- Fixed/tested pass overrides (#918)
- Fixed comet logger to log after train (#892)
- Remove deprecated args to learning rate step function (#890)
Contributors
@airglow, @akshaykvnit, @AljoSt, @AntixK, @awaelchli, @baeseongsu, @bobkemp, @Borda, @calclavia, @Calysto, @djbyrne, @ethanwharris, @fdelrio89, @hadim, @hanbyul-kim, @jeremyjordan, @kuynzereb, @luiscape, @MattPainter01, @neggert, @onkyo14taro, @peteriz, @shoarora, @SkafteNicki, @smallzzy, @srush, @theevann, @tullie, @williamFalcon, @xeTaiz, @xssChauhan, @yukw777
If we forgot someone due to not matching commit email with GitHub account, let us know :]