TPU Pod Training, IPU Accelerator, DeepSpeed Infinity, Fully Sharded Data Parallel
Today we are excited to announce Lightning 1.4, introducing support for TPU pods, XLA profiling, IPUs, and new plugins to reach 10+ billion parameters, including Deep Speed Infinity, Fully Sharded Data-Parallel and more!
https://devblog.pytorchlightning.ai/announcing-lightning-1-4-8cd20482aee9
[1.4.0] - 2021-07-27
Added
- Added
extract_batch_sizeutility and corresponding tests to extract batch dimension from multiple batch types (#8357) - Added support for named parameter groups in
LearningRateMonitor(#7987) - Added
dataclasssupport forpytorch_lightning.utilities.apply_to_collection(#7935) - Added support to
LightningModule.to_torchscriptfor saving to custom filesystems withfsspec(#7617) - Added
KubeflowEnvironmentfor use with thePyTorchJoboperator in Kubeflow - Added LightningCLI support for config files on object stores (#7521)
- Added
ModelPruning(prune_on_train_epoch_end=True|False)to choose when to apply pruning (#7704) - Added support for checkpointing based on a provided time interval during training (#7515)
- Progress tracking
- Added support for passing a
LightningDataModulepositionally as the second argument totrainer.{validate,test,predict}(#7431) - Added argument
trainer.predict(ckpt_path)(#7430) - Added
clip_grad_by_valuesupport for TPUs (#7025) - Added support for passing any class to
is_overridden(#7918) - Added
sub_dirparameter toTensorBoardLogger(#6195) - Added correct
dataloader_idxto batch transfer hooks (#6241) - Added
include_none=boolargument toapply_to_collection(#7769) - Added
apply_to_collectionsto apply a function to two zipped collections (#7769) - Added
ddp_fully_shardedsupport (#7487) - Added
should_rank_save_checkpointproperty to Training Plugins (#7684) - Added
log_grad_normhook toLightningModuleto customize the logging of gradient norms (#7873) - Added
save_config_filenameinit argument toLightningCLIto ease resolving name conflicts (#7741) - Added
save_config_overwriteinit argument toLightningCLIto ease overwriting existing config files (#8059) - Added reset dataloader hooks to Training Plugins and Accelerators (#7861)
- Added trainer stage hooks for Training Plugins and Accelerators (#7864)
- Added the
on_before_optimizer_stephook (#8048) - Added IPU Accelerator (#7867)
- Fault-tolerant training
- Added
{,load_}state_dicttoResultCollection(#7948) - Added
{,load_}state_dicttoLoops(#8197) - Set
Loop.restarting=Falseat the end of the first iteration (#8362) - Save the loops state with the checkpoint (opt-in) (#8362)
- Save a checkpoint to restore the state on exception (opt-in) (#8362)
- Added
state_dictandload_state_dictutilities forCombinedLoader+ utilities for dataloader (#8364)
- Added
- Added
rank_zero_onlytoLightningModule.logfunction (#7966) - Added
metric_attributetoLightningModule.logfunction (#7966) - Added a warning if
Trainer(log_every_n_steps)is a value too high for the training dataloader (#7734) - Added LightningCLI support for argument links applied on instantiation (#7895)
- Added LightningCLI support for configurable callbacks that should always be present (#7964)
- Added DeepSpeed Infinity Support, and updated to DeepSpeed 0.4.0 (#7234)
- Added support for
torch.nn.UninitializedParameterinModelSummary(#7642) - Added support
LightningModule.save_hyperparameterswhenLightningModuleis a dataclass (#7992) - Added support for overriding
optimizer_zero_gradandoptimizer_stepwhen using accumulate_grad_batches (#7980) - Added
loggerboolean flag tosave_hyperparameters(#7960) - Added support for calling scripts using the module syntax (
python -m package.script) (#8073) - Added support for optimizers and learning rate schedulers to
LightningCLI(#8093) - Added XLA Profiler (#8014)
- Added
PrecisionPlugin.{pre,post}_backward(#8328) - Added
on_load_checkpointandon_save_checkpointhooks to thePrecisionPluginbase class (#7831) - Added
max_depthparameter inModelSummary(#8062) - Added
XLAStatsMonitorcallback (#8235) - Added
restorefunction andrestartingattribute to baseLoop(#8247) - Added
FastForwardSamplerandCaptureIterableDataset(#8307) - Added support for
save_hyperparametersinLightningDataModule(#3792) - Added the
ModelCheckpoint(save_on_train_epoch_end)to choose when to run the saving logic (#8389) - Added
LSFEnvironmentfor distributed training with the LSF resource managerjsrun(#5102) - Added support for
accelerator='cpu'|'gpu'|'tpu'|'ipu'|'auto'(#7808) - Added
tpu_spawn_debugto plugin registry (#7933) - Enabled traditional/manual launching of DDP processes through
LOCAL_RANKandNODE_RANKenvironment variable assignments (#7480) - Added
quantize_on_fit_endargument toQuantizationAwareTraining(#8464) - Added experimental support for loop specialization (#8226)
- Added support for
devicesflag to Trainer (#8440) - Added private
prevent_trainer_and_dataloaders_deepcopycontext manager on theLightningModule(#8472) - Added support for providing callables to the Lightning CLI instead of types (#8400)
Changed
- Decoupled device parsing logic from Accelerator connector to Trainer (#8180)
- Changed the
Trainer'scheckpoint_callbackargument to allow only boolean values (#7539) - Log epoch metrics before the
on_evaluation_endhook (#7272) - Explicitly disallow calling
self.log(on_epoch=False)during epoch-only or single-call hooks (#7874) - Changed these
Trainermethods to be protected:call_setup_hook,call_configure_sharded_model,pre_dispatch,dispatch,post_dispatch,call_teardown_hook,run_train,run_sanity_check,run_evaluate,run_evaluation,run_predict,track_output_for_epoch_end - Changed
metrics_to_scalarsto work with any collection or value (#7888) - Changed
clip_grad_normto usetorch.nn.utils.clip_grad_norm_(#7025) - Validation is now always run inside the training epoch scope (#7357)
ModelCheckpointnow runs at the end of the training epoch by default (#8389)EarlyStoppingnow runs at the end of the training epoch by default (#8286)- Refactored Loops
- Moved attributes
global_step,current_epoch,max/min_steps,max/min_epochs,batch_idx, andtotal_batch_idxto TrainLoop (#7437) - Refactored result handling in training loop (#7506)
- Moved attributes
hiddensandsplit_idxto TrainLoop (#7507) - Refactored the logic around manual and automatic optimization inside the optimizer loop (#7526)
- Simplified "should run validation" logic (#7682)
- Simplified logic for updating the learning rate for schedulers (#7682)
- Removed the
on_epochguard from the "should stop" validation check (#7701) - Refactored internal loop interface; added new classes
FitLoop,TrainingEpochLoop,TrainingBatchLoop(#7871, #8077) - Removed
pytorch_lightning/trainer/training_loop.py(#7985) - Refactored evaluation loop interface; added new classes
DataLoaderLoop,EvaluationLoop,EvaluationEpochLoop(#7990, #8077) - Removed
pytorch_lightning/trainer/evaluation_loop.py(#8056) - Restricted public access to several internal functions (#8024)
- Refactored trainer
_run_*functions and separate evaluation loops (#8065) - Refactored prediction loop interface; added new classes
PredictionLoop,PredictionEpochLoop(#7700, #8077) - Removed
pytorch_lightning/trainer/predict_loop.py(#8094) - Moved result teardown to the loops (#8245)
- Improve
LoopAPI to better handle childrenstate_dictandprogress(#8334)
- Moved attributes
- Refactored logging
- Renamed and moved
core/step_result.pytotrainer/connectors/logger_connector/result.py(#7736) - Dramatically simplify the
LoggerConnector(#7882) trainer.{logged,progress_bar,callback}_metricsare now updated on-demand (#7882)- Completely overhaul the
Resultobject in favor ofResultMetric(#7882) - Improve epoch-level reduction time and overall memory usage (#7882)
- Allow passing
self.log(batch_size=...)(#7891) - Each of the training loops now keeps its own results collection (#7891)
- Remove
EpochResultStoreandHookResultStorein favor ofResultCollection(#7909) - Remove
MetricsHolder(#7909)
- Renamed and moved
- Moved
ignore_scalar_return_in_dpwarning suppression to the DataParallelPlugin class (#7421) - Changed the behaviour when logging evaluation step metrics to no longer append
/epoch_*to the metric name (#7351) - Raised
ValueErrorwhen aNonevalue isself.log-ed (#7771) - Changed
resolve_training_type_pluginsto allow settingnum_nodesandsync_batchnormfromTrainersetting (#7026) - Default
seed_everything(workers=True)in theLightningCLI(#7504) - Changed
model.state_dict()inCheckpointConnectorto allowtraining_type_pluginto customize the model'sstate_dict()(#7474) MLflowLoggernow uses the env variableMLFLOW_TRACKING_URIas default tracking URI (#7457)- Changed
Trainerarg and functionality fromreload_dataloaders_every_epochtoreload_dataloaders_every_n_epochs(#5043) - Changed
WandbLogger(log_model={True/'all'})to log models as artifacts (#6231) - MLFlowLogger now accepts
run_nameas an constructor argument (#7622) - Changed
teardown()inAcceleratorto allowtraining_type_pluginto customizeteardownlogic (#7579) Trainer.fitnow raises an error when using manual optimization with unsupported features such asgradient_clip_valoraccumulate_grad_batches(#7788)- Accelerator hooks are called regardless if
LightningModuleoverrides the same hooks (#7826) - Moved profilers to their own file (#7822)
- The
on_after_backwardhook is now called on accumulating iterations. Use theon_before_optimizer_stephook to mimic the old behaviour (#8328) - The mixed precision loss is no longer unscaled before the
on_after_backwardhook. Use theon_before_optimizer_stephook to mimic the old behaviour (#8328) - The
TrainingTypePlugin.{pre,post}_backwardhooks no longer take theoptimizer, opt_idx, should_accumulatearguments (#8328) - The
PrecisionPlugin.backwardhooks no longer returns a value (#8328) - The
PrecisionPlugin.backwardhooks no longer takes ashould_accumulateargument (#8328) - Added the
on_before_backwardhook (#7865) LightningCLInow aborts with a clearer message if config already exists and disables save config duringfast_dev_run(#7963)- Saved the
LightningCLIconfig onsetupand only on the main process (#8017) - Dropped the
LightningCLIArgumentParserwhen pickling (#8017) - Skip
broadcastif distributed not initialized for the spawn plugins (#8017) Trainer(resume_from_checkpoint=...)now restores the model directly afterLightningModule.setup(), which is beforeLightningModule.configure_sharded_model()(#7652)- Moved
torch.cuda.set_device()to enable collective calls earlier in setup (#8312) - Used XLA utility API to move data to CPU (Single TPU core) (#8078)
- Improved error messages in
replace_samplerwhen theDataLoaderattributes are not included in the signature or the signature is missing optional arguments (#8519) - Moved
DeviceDtypeModuleMixinandHyperparametersMixinmixin tocore(#8396) - Return the
default_root_diras thelog_dirwhen the logger is aLoggerCollection(#8187)
Deprecated
- Deprecated
LightningModule.loaded_optimizer_states_dict(#8229) - Standardized the dataloaders arguments of
trainer.{fit,valdiate,test,tune}(#7431) - Deprecated
DataModuleproperties:has_prepared_data,has_setup_fit,has_setup_validate,has_setup_test,has_setup_predict,has_teardown_fit,has_teardown_validate,has_teardown_test,has_teardown_predict(#7657) - Deprecated
TrainerModelHooksMixinin favor ofpytorch_lightning.utilities.signature_utils(#7422) - Deprecated
num_nodesandsync_batchnormarguments inDDPPluginandDDPSpawnPlugin(#7026) - Deprecated
self.log(sync_dist_op)in favor ofself.log(reduce_fx). (#7891) - Deprecated
is_overridden(model=...)in favor ofis_overridden(instance=...)(#7918) - Deprecated automatically detaching returned extras with grads (#7994)
- Deprecated default value of
monitorargument in EarlyStopping callback to enforcemonitoras a required argument (#7907) - Deprecated importing
rank_zero_{warn,deprecation}directly frompytorch_lightning.utilities.distributed(#8085) - Deprecated the use of
CheckpointConnector.hpc_load()in favor ofCheckpointConnector.restore()(#7652) - Deprecated
ModelCheckpoint(every_n_val_epochs)in favor ofModelCheckpoint(every_n_epochs)(#8383) - Deprecated
DDPPlugin.task_idxin favor ofDDPPlugin.local_rank(#8203) - Deprecated the
Trainer.train_loopproperty in favor ofTrainer.fit_loop(#8025) - Deprecated the
Trainer.disable_validationproperty in favor ofnot Trainer.enable_validation(#8291) - Deprecated
modeparameter inModelSummaryin favor ofmax_depth(#8062) - Deprecated
reload_dataloaders_every_epochargument ofTrainerin favor ofreload_dataloaders_every_n_epochs(#5043) - Deprecated
distributed_backendargument forTrainer(#8575)
Removed
- Dropped official support/testing for PyTorch <1.6 (#8288)
- Removed
ProfilerConnector(#7654) - Pruned deprecated classif. metrics from
pytorch_lightning.metrics.functional.classification(#7499) - Removed deprecated data parallel classes
LightningDataParallelandLightningDistributedDataParallelfrompytorch_lightning.overrides.data_parallel(#7510) - Removed deprecated trainer attributes -
get_modelandaccelerator_backend(#7502) - Removed support for automatically monitoring the
val_losskey withModelCheckpoint. Pass yourmonitorof choice to theModelCheckpointinstance instead (#8293) - Removed support for
self.log(tbptt_reduce_fx)andself.log(tbptt_pad_token). Please, open a discussion explaining your use-case if you relied on these. (#7644) - Removed deprecated utils modules
model_utils,warning_utils,xla_device_utilsand partiallyargparse_utils(#7503) - Removed
RPCPluginandRPCSequentialPlugin. If you were successfully using these plugins, please open a GitHub discussion about your use case (#8101) - Removed deprecated trainer attributes -
on_cpu,on_tpu,use_tpu,on_gpu,use_dp,use_ddp,use_ddp2,use_horovod,use_single_gpu(#7501) - Removed deprecated
optimizerargument inLightningModule.manual_backward(); Toggling optimizers in manual optimization should be done usingLightningModule.{un}toggle_optimizer()(#8287) - Removed DeepSpeed FP16 Exception as FP32 is now supported (#8462)
- Removed environment variable
PL_EXP_VERSIONfrom DDP subprocesses (#7403)
Fixed
- Fixed the
GPUStatsMonitorcallbacks to use the correct GPU IDs ifCUDA_VISIBLE_DEVICESset (#8260) - Fixed
lr_schedulercheckpointed state by callingupdate_lr_schedulersbefore saving checkpoints (#7877) - Fixed ambiguous warning when both overfit and train dataloader shuffling are enabled (#7685)
- Fixed dev debugger memory growing due to tracking events even when disabled (#7875)
- Fixed
Noneloss keys getting added intraining_epoch_endwhen using manual optimization and not returning a loss (#7772) - Fixed a bug where
precision=64withaccelerator='ddp_spawn'would throw a pickle error (#6924) - Do not override the existing
epochvalue inlogged_metricswhen already logged by the user (#7982) - Support for manual optimization with DeepSpeed (#7970)
- Fixed
dataloader_idxargument value when predicting with only oneDataLoader(#7941) - Fixed passing the
stageargument ofCallback.{setup,teardown}as a keyword (#7973) - Fixed metrics generated during
validation sanity checkingare cleaned on end (#8171) - Fixed
log_gpu_memorymetrics not being added tologgingwhen nothing else is logged (#8174) - Fixed a bug where calling
logwith aMetricinstance would raise an error if it was a nested attribute of the model (#8181) - Fixed a bug where using
precision=64would cause buffers with complex dtype to be cast to real (#8208) - Fixed
is_overriddenreturning true for wrapped functions with no changes (#8296) - Fixed a bug where
truncated_bptt_stepswould throw an AttributeError when the target RNN has multiple hidden states (#8145) - Fixed
self.optimizers()not returning a single optimizer if it had been wrapped (#8326) - Fixed the
on_after_backwardhook not getting called when using manual optimization and no plugins (#8328) - Fixed the
LightningModule.backwardhook only getting called with theapexplugin when using manual optimization (#8328) - Fixed moving batch to device before sending it to the
on_*_batch_start/on_*_batch_endcallbacks and model hooks (#7378) - Fixed passing a custom
DDPPluginwhen choosingaccelerator="ddp_cpu"for the accelerator (#6208) - Fixed missing call to
LightningModule.untoggle_optimizerin training loop when running gradient accumulation with multiple optimizers (#8284) - Fixed hash of LightningEnum to work with value instead of name (#8421).
- Fixed a bug where an extra checkpoint was saved at the end of training if the
val_check_intervaldid not align with the number of training batches (#7724) - Fixed hash of LightningEnum to work with value instead of name(#8421).
- Fixed
move_data_to_deviceto return the batch if the objecttofunction didn't returnself(#8433) - Fixed progress bar updates for Pod Training (#8258)
- Fixed clearing dataloader references before attaching new dataloaders in consecutive `Trainer.{fit,validate,test,predict}´ runs (#8442)
- Fixed memory leaks on GPU by moving
optimizer_states,ResultCollection.extra,ResultMetricattributes, andLoggerConnectormetrics tocpu. Also, delete the DDP wrapper onteardown(#8490) - Fixed
SWAcallback using LightningModuleprevent_trainer_and_dataloaders_deepcopyto avoid OOM (#8472) - Fixed
ModelPruningcallbackon_save_checkpointto avoid making adeepcopypotentially leading to OOM (#8472) - Fixed the sampler replacement logic for
DataLoaders which do not define allDataLoaderattributes as__init__parameters (#8519) - Fixed DeepSpeed Windows support (#8488)
- Fixed DeepSpeed not properly setting the trainer
lr_schedulersattribute (#8527) - Fixed experiment version and log-dir divergence in DDP when using multiple
Trainerinstances in sequence (#7403) - Enabled manual optimization for TPUs (#8458)
- Fixed
accumulate_grad_batchesnot been recomputed during model reload (#5334) - Fixed a
TypeErrorwhen wrapping optimizers in theHorovodPluginand runningTrainer.test(#7840) - Fixed
BackboneFinetuningrestoration (#8501) - Fixed
lr_schedulerwith metric (e.g.torch.optim.lr_scheduler.ReduceLROnPlateau) when usingautomatic_optimization = False(#7643) - Fixed
DeepSpeedbreaking with no schedulers (#8580)
Contributors
@00sapo @AffineParameter @ajtritt @akihironitta @ananthsub @aniketmaurya @aslisabanci @awaelchli @bamblebam @Borda @borisdayma @carmocca @dalek-who @DavidMChan @davors72 @dcfidalgo @ddrevicky @deepsource-autofix @djthegr8 @edenlightning @edgarriba @eladsegal @ethanwharris @eugeneh101 @fepegar @gaoteng-git @gtauzin @i-aki-y @janhenriklambrechts @jiwidi @justusschock @karthikrangasai @kaushikb11 @loic-beheshti @Lucklyric @ManuelPalermo @mauvilsa @maxoppelt @neggert @nikvaessen @nisheethlahoti @pre-commit-ci @rohitgr7 @ruotianluo @satishjasthi @SeanNaren @shirayu @shuyingsunshine21 @sid-sundrani @Sileadim @simran2905 @stancld @t-vi @tchaton @theblackfly @theodumont @tilman151 @tomy0000000 @tshu-w @vatch123 @WrRan @yifuwang
If we forgot someone, let us know :]