DDP and Checkpoint bug fixes
Pre-release
Pre-release
Overview
As we continue to strengthen the codebase with more tests, we’re finally getting rid of annoying bugs that have been around for a bit now. Mostly around the inconsistent checkpoint and early stopping behaviour (amazing work @awaelchli @jeremyjordan )
Noteworthy changes:
- Fixed TPU flag parsing
- fixed average_precision metric
- all the checkpoint issues should be gone now (including backward support for old checkpoints)
- DDP + loggers should be fixed
Detail changes
Added
- Added TorchText support for moving data to GPU (#2379)
Changed
- Changed epoch indexing from 0 instead of 1 (#2289)
- Refactor Model
backward(#2276) - Refactored
training_batch+ tests to verify correctness (#2327, #2328) - Refactored training loop (#2336)
- Made optimization steps for hooks (#2363)
- Changed default apex level to 'O2' (#2362)
Removed
- Moved
TrainsLoggerto Bolts (#2384)
Fixed
- Fixed parsing TPU arguments and TPU tests (#2094)
- Fixed number batches in case of multiple dataloaders and
limit_{*}_batches(#1920, #2226) - Fixed an issue with forward hooks not being removed after model summary (#2298)
- Fix for
load_from_checkpoint()not working with absolute path on Windows (#2294) - Fixed an issue how _has_len handles
NotImplementedErrore.g. raised bytorchtext.data.Iterator(#2293), (#2307) - Fixed
average_precisionmetric (#2319) - Fixed ROC metric for CUDA tensors (#2304)
- Fixed
average_precisionmetric (#2319) - Fixed lost compatibility with custom datatypes implementing
.to(#2335) - Fixed loading model with kwargs (#2387)
- Fixed sum(0) for
trainer.num_val_batches(#2268) - Fixed checking if the parameters are a
DictConfigObject (#2216) - Fixed SLURM weights saving (#2341)
- Fixed swaps LR scheduler order (#2356)
- Fixed adding tensorboard
hparamslogging test (#2342) - Fixed use model ref for tear down (#2360)
- Fixed logger crash on DDP (#2388)
- Fixed several issues with early stopping and checkpoint callbacks (#1504, #2391)
- Fixed loading past checkpoints from v0.7.x (#2405)
- Fixed loading model without arguments (#2403)
Contributors
@airium, @awaelchli, @Borda, @elias-ramzi, @jeremyjordan, @lezwon, @mateuszpieniak, @mmiakashs, @pwl, @rohitgr7, @ssakhavi, @thschaaf, @tridao, @williamFalcon
If we forgot someone due to not matching commit email with GitHub account, let us know :]