Skip to content

DDP and Checkpoint bug fixes

Pre-release
Pre-release

Choose a tag to compare

@williamFalcon williamFalcon released this 29 Jun 02:09
· 8341 commits to master since this release
8f07b77

Overview

As we continue to strengthen the codebase with more tests, we’re finally getting rid of annoying bugs that have been around for a bit now. Mostly around the inconsistent checkpoint and early stopping behaviour (amazing work @awaelchli @jeremyjordan )

Noteworthy changes:

  • Fixed TPU flag parsing
  • fixed average_precision metric
  • all the checkpoint issues should be gone now (including backward support for old checkpoints)
  • DDP + loggers should be fixed

Detail changes

Added

  • Added TorchText support for moving data to GPU (#2379)

Changed

  • Changed epoch indexing from 0 instead of 1 (#2289)
  • Refactor Model backward (#2276)
  • Refactored training_batch + tests to verify correctness (#2327, #2328)
  • Refactored training loop (#2336)
  • Made optimization steps for hooks (#2363)
  • Changed default apex level to 'O2' (#2362)

Removed

  • Moved TrainsLogger to Bolts (#2384)

Fixed

  • Fixed parsing TPU arguments and TPU tests (#2094)
  • Fixed number batches in case of multiple dataloaders and limit_{*}_batches (#1920, #2226)
  • Fixed an issue with forward hooks not being removed after model summary (#2298)
  • Fix for load_from_checkpoint() not working with absolute path on Windows (#2294)
  • Fixed an issue how _has_len handles NotImplementedError e.g. raised by torchtext.data.Iterator (#2293), (#2307)
  • Fixed average_precision metric (#2319)
  • Fixed ROC metric for CUDA tensors (#2304)
  • Fixed average_precision metric (#2319)
  • Fixed lost compatibility with custom datatypes implementing .to (#2335)
  • Fixed loading model with kwargs (#2387)
  • Fixed sum(0) for trainer.num_val_batches (#2268)
  • Fixed checking if the parameters are a DictConfig Object (#2216)
  • Fixed SLURM weights saving (#2341)
  • Fixed swaps LR scheduler order (#2356)
  • Fixed adding tensorboard hparams logging test (#2342)
  • Fixed use model ref for tear down (#2360)
  • Fixed logger crash on DDP (#2388)
  • Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
  • Fixed loading past checkpoints from v0.7.x (#2405)
  • Fixed loading model without arguments (#2403)

Contributors

@airium, @awaelchli, @Borda, @elias-ramzi, @jeremyjordan, @lezwon, @mateuszpieniak, @mmiakashs, @pwl, @rohitgr7, @ssakhavi, @thschaaf, @tridao, @williamFalcon

If we forgot someone due to not matching commit email with GitHub account, let us know :]