Releases: ghanvert/AcceleratorModule
v1.9.0
Fixed:
- 'self.metrics' being None when there are no metrics implemented, leading to errors.
- 'breakpoint()' does not work when 'dataloader_num_workers' is default (num_processes) or > 0. When debugging, now 'dataloader_num_workers' is set to 0.
- If user returns a single tensor instead of dictionary in 'validation_step', we assume it's the loss value.
- 'epoch' not correctly tracked.
- Added a RuntimeError to notify the user to align 'validation_step' with declared metrics.
- When appending metric tensors into a list, we verify if the last tensor added has similar dimensions to the current one, so we don't get an error at the very end of the evaluation.
- Added support to run on CPU (this might break again, be careful).
- When calling for example 'generate' using DDP (or maybe FSDP as well), an internal function from nn.Module created by the user, this gives an error because DistributedDataParallel has no attribute called 'generate'. This does not happen with DeepSpeedEngine, since it knows what functions are intended for inference and training.
- Gradient normalization (type 2) is now calculated without creating an intermediary array of tensors that could lead to some memory overhead.
- 'grad_norm' not being reported.
- When using DeepSpeed, floating-point elements in batch needed an autocast handled by the user (since inputs to the model need to be in half precision). We now handle this in the background so the user does not have to worry.
- Tensor Cores are only enabled if the system supports it.
What's new in Trainer:
- New argument 'batch_device_placement' to control batch in CPU or GPU.
- New argument 'prepare_batch' to handle specific cases for autocast in batch (DeepSpeed).
- New argument 'safe_steps' to re-try one time running through a batch.
- New argument 'destroy_after_training' to kill process group at the end of 'fit' function.
- New argument 'enable_prepare_logging' that enables logging preparation. It handles extreme scenarios like DeepSpeed that adds many logging steps when preparing the model. By default is set to
False, so users will no longer see DeepSpeed messages by default. - New argument 'multiple_checkpoints' to enable multiple checkpoints.
- New argument 'max_checkpoints' to set a maximum number of checkpoints to save in disk.
- New argument 'gradient_checkpointing' to enable gradient checkpointing if implemented in
nn.Module(like in thetransformerslibrary). - New argument 'compile_kwargs' to add additional customization for
torch.compile. - New argument
safe_mode. Forward passes of the model will run through the wrapper instead of skipping it (old behavior). - 'log_with' now receives an string value instead of a class (like 'mlflow').
- NOTE: We're only supporting MLFlow for now as a tracker. Future updates will have all trackers implemented.
What's new in AcceleratorModule:
- New 'log' function that logs a dictionary of key-values to the tracker, taking in consideration the 'log_every' argument in Trainer. Use 'log_' to avoid this consideration.
- New 'freeze' helper function to freeze a module (requires_grad=False).
- New 'unfreeze' helper function to unfreeze a module (requires_grad=True).
- New 'pad' helper function to pad tensors.
What's new in metrics:
- Tensors are converted from half precision to float32 precision.
- New 'MetricParallel' module. This is the same implementation of 'Metric', but instead uses all processes to execute evaluation, which means tensors are not being gathered to the main process. When computing the final value, a communication happens across all processes to average metric values. This is useful when calculating metrics in a single process is too slow, and the process itself can run in parallel.
What's new in Monitor:
- Added an extra argument 'checkpoint' to log to the tracker whenever a checkpoint is made.
[NEW] Hyper Parameter Search
This is a new implementation to run hyper parameter search using Optuna as the backend. Everything is handled by HyperParameterSearch that can be imported directly from 'accmt'.
[BETA] Asynchronous Evaluation
Sometimes evaluation and model saving can be a big bottleneck. If you assign at least 10% of your resources to evaluation and 90% for training, you could see 30%+ speedups, as training process no longer waits for evaluation to finish, instead it dispatches evaluations to an evaluation group that will wait for the training group to send a request. One way to do this is to write models to disk and have another process waiting for models to appear. Instead, we take a different approach: we move the model parameters directly to the evaluation group without writing to disk. This works because there's a CPU tunnel between the training and evaluation group where groups can send data (SharedMemory or SHM).
Why is it a big deal? nothing from your code changes. You only change your launch command:
Before
accmt launch -n=0,1,2,3 ...After
accmt alaunch -n=0,1,2 -e=3 ...Here, 'alaunch' or 'async-launch' will handle everything for you, and you're required to specify GPUs for training '-n' or '-N', and '-e' for evaluation.
WARNING: This is a BETA feature and is not meant to be used as of yet.
When v2.0?
There are some tests that I need to implement to keep everything nice and not suddenly break something. Also there are many bugs and errors to handle, as well as to implement all trackers.
v2.0 is meant to be the first stable version and official release that engineers or researchers could use for their work.
I am planning to also implement multi-node launches. Since we use Accelerate as the backend, this library is pretty scalable.
v1.8.0
This is an official full-rewrite of the library, getting ready for 2.0 update. This rewrite includes many bug fixes along with new features. Some of the user's old code might change by just a bit, although previous training runs cannot run with this new update 😢 (unless you do some updates to your checkpoint folder, which might not be efficient).
REPLACEMENTS AND REMOVALS:
model_savingparameter in Trainer class no longer exists. It was replaced bytrainer.register_model_savingfunction.model_saving_belowandmodel_saving_aboveparameters in Trainer were removed. Now they exist inregister_model_saving.optimparameter no longer exists in object (or dictionary) of HyperParameters. It was replaced byoptimizer.collate_fnparamater in Trainer no longer exists. Available options arecollate_fn_trainandcollate_fn_val.checkpointparameter in Trainer no longer exists. It was replaced by a boolean parameterenable_checkpointing(defaults to True).- Declaration of
validation_stephas changed fromdef validation_step(self, batch)todef validation_step(self, key, batch). - Internal
status_dictno longer exists. It was replaced bystatewhich is a class containing the previous and new parameters to track training state. report_train_loss_per_epochwas removed and its functionality is handled bylog_every. If this parameter is set to a value less than 0 (-1), it will report train loss at the end of the epoch.handlersparameter in Trainer was removed since it was producing a lot of errors and crashes.shuffle_validationparameter in Trainer was removed since it does not make sense to shuffle a validation dataset.
NEW FEATURES:
- Multiple evaluations supported! Now you can pass a list or dictionary of evaluation datasets. Each dataset will have its corresponding key that can be accessed in the
validation_stepfunction. - Model saving with additional syntax to save best models based on metrics of specific datasets and best metric values of different metrics.
- New parameter in Trainer
compile_kwargs. This is a dictionary with additional kwargs for torch.compile. - A new better looking progress bar!. This is actually the same library (tqdm), but with colors and less size. Also this new feature removes some weird visual bugs that were overlapping training and validation progress bars.
loopfunction in Trainer can be modified by inheritance!. If you want to add more customization to your training loop, you can create another trainer class inheriting from Trainer to modify the loop.- Better code, better throughput!. Since this is an almost complete rewrite of the library, every decision in the code was done to optimize throughput and make the code more readable than before for better maintainability.
callbackparameter in Trainer can now contain multiple callbacks!.- New
disable_model_savingparameter in Trainer. - New
safe_modeparameter in Trainer. Running in safe mode (default) means that forward passes will be done using the corresponding wrapper (DDP, FSDP or DeepSpeedEngine). Ifset_mode=False, this means that wrapper will be skipped and use the model directly. This slightly improves throughput, although it is unsure that gradients will be correctly synchronized across all devices. - With the new addition of multiple evaluations,
metricsparameter in Trainer can be a dictionary, where keys are the name of dataset keys and values are the metrics to implement for that particular evaluation dataset. Basically, you can now have different metrics per dataset.
BUG FIXES:
clip_gradparameter in Trainer was not working in DeepSpeed, because the configuration file sets an automatic value of 1.0 (default). We changed this behavior to always specify gradient clipping throughclip_gradparameter. The default gradient clipping value, independent of the strategy applied, will always be 0.grad_accumulation_stepswas not being correctly handled and led to incorrect results.patiencewas not being correctly handled. If this value was set higher than 0, all model savings, even if they were better than previous results, will end up finishing since patience was always being reduced every time it was time to save the model (after evaluation).- First evaluation set with
eval_when_start=Trueno longer saves the model or checkpoint, because there is no point of saving the model when there is no progress at all. - Last evaluation set with
eval_when_finish=Trueno longer occurs twice whenever another evaluation was done in the last step of an epoch or at the end of an epoch. - Resuming from checkpoint was always setting global seeds depending on the epoch (0, 1, 2, etc), independently if the user set a seed already. To mitigate this, when setting a seed with
set_seedfunction this will save a global seed to access it afterwards in a new epoch, so the seeds will be set to GLOBAL_SEED + EPOCH. - Resuming from checkpoint and logging was resulting in wrong results, both train loss report and current step number (at least on the first log produced after resuming).
v1.7.7
v1.7.5
v1.7.4.3
What's Changed
- Modified metric system by @ghanvert in #8: Supporting many arguments for 'compute' function (not just 'predictions' and 'references').
- Changed metric step numbers by @ghanvert in #10: Metric step numbers now starting at 0 instead of 1.
- Added new argument in Trainer 'report_train_loss_per_epoch', to report train loss at the end of every epoch. As of now, you might also want to set 'log_every' to None, to not report loss per step.
- Some minor optimizations related to tracking tensors.
v1.7.4.2 Starting point, preparing everything towards 2.0!
I haven't released tags for a while. From now on, this will be the common practice.
I'm preparing everything for version 2.0, which is going to be the first major release containing a stable version and more features. Feel free to contribute to this project, you can suggest new features or add new ones if you want.
Some of the features that version 2.0 will have are:
- Support for all trackers (for now, we're only having stable results with MLFlow and Tensorboard).
- HyperParameter Search integration.
- Multi-model support. For now, we're only supporting 'model' and 'teacher'.
- Better ways to run logic code depending on the state of the internal training loop.
- Multiple evaluations in 'Trainer'.
- Parallel evaluation out of the box (instead of 'Trainer', we'll have something like 'Evaluator').
- Remove need to return 'loss' in 'validation_step' function in 'Trainer'.
- Better ways to create/define metrics.
- And some more!
v1.1.2
ACCMT v1.1.2 changelog:
- Fixed 'function' type error.
- 'num_warmup_steps' can be a ratio (float value) between 0 and 1, to represent warmup ratio and calculate warmup steps automatically.
- 'warmup_ratio' added for scheduler configuration if wanted.
- Added 'dataloader_pin_memory' option for Trainer arguments.
- Added 'dataloader_num_workers' option or Trainer arguments.
- Made 'status_dict' be optional to pass as argument in step functions.
- Add 'evaluations_done' key to 'status_dict' in case that ACCMT was updated from < 1.1.0 version to a higher one.
- 'allow_tf32' can now be correctly imported.
- Added clean documentation.
v1.1.1
ACCMT v1.1.1
- 'num_warmup_steps' can be a ratio (float value) between 0 and 1, to represent warmup ratio and calculate warmup steps automatically.
- 'warmup_ratio' added for scheduler configuration if wanted.
- Added 'dataloader_pin_memory' option for Trainer arguments.
- Added 'dataloader_num_workers' option or Trainer arguments.
- Made 'status_dict' be optional to pass as argument in step functions.
- Add 'evaluations_done' key to 'status_dict' in case that ACCMT was updated from < 1.1.0 version to a higher one.
- 'allow_tf32' can now be correctly imported.
- Added clean documentation.
v1.1.0
Bug fix:
- When resuming from checkpoint, one batch was being repeated.
Added new features:
- Support for automatic FSDP integration and adaptation.
- 'status_dict' integration for correct model saving and checkpointing. Can be accessed from step functions.
- 'checkpoint_every' argument can now replace 'enable_checkpointing' and 'checkpoint_strat'.
- Checkpointing can be done also every N evaluations.
Install via pip:
pip install -U accmt