12 May 05:51

ghanvert

6cc8547

v1.9.0 Latest

Latest

Fixed:

'self.metrics' being None when there are no metrics implemented, leading to errors.
'breakpoint()' does not work when 'dataloader_num_workers' is default (num_processes) or > 0. When debugging, now 'dataloader_num_workers' is set to 0.
If user returns a single tensor instead of dictionary in 'validation_step', we assume it's the loss value.
'epoch' not correctly tracked.
Added a RuntimeError to notify the user to align 'validation_step' with declared metrics.
When appending metric tensors into a list, we verify if the last tensor added has similar dimensions to the current one, so we don't get an error at the very end of the evaluation.
Added support to run on CPU (this might break again, be careful).
When calling for example 'generate' using DDP (or maybe FSDP as well), an internal function from nn.Module created by the user, this gives an error because DistributedDataParallel has no attribute called 'generate'. This does not happen with DeepSpeedEngine, since it knows what functions are intended for inference and training.
Gradient normalization (type 2) is now calculated without creating an intermediary array of tensors that could lead to some memory overhead.
'grad_norm' not being reported.
When using DeepSpeed, floating-point elements in batch needed an autocast handled by the user (since inputs to the model need to be in half precision). We now handle this in the background so the user does not have to worry.
Tensor Cores are only enabled if the system supports it.

What's new in Trainer:

New argument 'batch_device_placement' to control batch in CPU or GPU.
New argument 'prepare_batch' to handle specific cases for autocast in batch (DeepSpeed).
New argument 'safe_steps' to re-try one time running through a batch.
New argument 'destroy_after_training' to kill process group at the end of 'fit' function.
New argument 'enable_prepare_logging' that enables logging preparation. It handles extreme scenarios like DeepSpeed that adds many logging steps when preparing the model. By default is set to False, so users will no longer see DeepSpeed messages by default.
New argument 'multiple_checkpoints' to enable multiple checkpoints.
New argument 'max_checkpoints' to set a maximum number of checkpoints to save in disk.
New argument 'gradient_checkpointing' to enable gradient checkpointing if implemented in nn.Module (like in the transformers library).
New argument 'compile_kwargs' to add additional customization for torch.compile.
New argument safe_mode. Forward passes of the model will run through the wrapper instead of skipping it (old behavior).
'log_with' now receives an string value instead of a class (like 'mlflow').
NOTE: We're only supporting MLFlow for now as a tracker. Future updates will have all trackers implemented.

What's new in AcceleratorModule:

New 'log' function that logs a dictionary of key-values to the tracker, taking in consideration the 'log_every' argument in Trainer. Use 'log_' to avoid this consideration.
New 'freeze' helper function to freeze a module (requires_grad=False).
New 'unfreeze' helper function to unfreeze a module (requires_grad=True).
New 'pad' helper function to pad tensors.

What's new in metrics:

Tensors are converted from half precision to float32 precision.
New 'MetricParallel' module. This is the same implementation of 'Metric', but instead uses all processes to execute evaluation, which means tensors are not being gathered to the main process. When computing the final value, a communication happens across all processes to average metric values. This is useful when calculating metrics in a single process is too slow, and the process itself can run in parallel.

What's new in Monitor:

Added an extra argument 'checkpoint' to log to the tracker whenever a checkpoint is made.

[NEW] Hyper Parameter Search

This is a new implementation to run hyper parameter search using Optuna as the backend. Everything is handled by HyperParameterSearch that can be imported directly from 'accmt'.

[BETA] Asynchronous Evaluation

Sometimes evaluation and model saving can be a big bottleneck. If you assign at least 10% of your resources to evaluation and 90% for training, you could see 30%+ speedups, as training process no longer waits for evaluation to finish, instead it dispatches evaluations to an evaluation group that will wait for the training group to send a request. One way to do this is to write models to disk and have another process waiting for models to appear. Instead, we take a different approach: we move the model parameters directly to the evaluation group without writing to disk. This works because there's a CPU tunnel between the training and evaluation group where groups can send data (SharedMemory or SHM).

Why is it a big deal? nothing from your code changes. You only change your launch command:
Before

accmt launch -n=0,1,2,3 ...

After

accmt alaunch -n=0,1,2 -e=3 ...

Here, 'alaunch' or 'async-launch' will handle everything for you, and you're required to specify GPUs for training '-n' or '-N', and '-e' for evaluation.

WARNING: This is a BETA feature and is not meant to be used as of yet.

When v2.0?

There are some tests that I need to implement to keep everything nice and not suddenly break something. Also there are many bugs and errors to handle, as well as to implement all trackers.

v2.0 is meant to be the first stable version and official release that engineers or researchers could use for their work.

I am planning to also implement multi-node launches. Since we use Accelerate as the backend, this library is pretty scalable.

Assets 2

02 Apr 16:03

ghanvert

v1.8.0

60263b6

v1.8.0

This is an official full-rewrite of the library, getting ready for 2.0 update. This rewrite includes many bug fixes along with new features. Some of the user's old code might change by just a bit, although previous training runs cannot run with this new update 😢 (unless you do some updates to your checkpoint folder, which might not be efficient).

REPLACEMENTS AND REMOVALS:

model_saving parameter in Trainer class no longer exists. It was replaced by trainer.register_model_saving function.
model_saving_below and model_saving_above parameters in Trainer were removed. Now they exist in register_model_saving.
optim parameter no longer exists in object (or dictionary) of HyperParameters. It was replaced by optimizer.
collate_fn paramater in Trainer no longer exists. Available options are collate_fn_train and collate_fn_val.
checkpoint parameter in Trainer no longer exists. It was replaced by a boolean parameter enable_checkpointing (defaults to True).
Declaration of validation_step has changed from def validation_step(self, batch) to def validation_step(self, key, batch).
Internal status_dict no longer exists. It was replaced by state which is a class containing the previous and new parameters to track training state.
report_train_loss_per_epoch was removed and its functionality is handled by log_every. If this parameter is set to a value less than 0 (-1), it will report train loss at the end of the epoch.
handlers parameter in Trainer was removed since it was producing a lot of errors and crashes.
shuffle_validation parameter in Trainer was removed since it does not make sense to shuffle a validation dataset.

NEW FEATURES:

Multiple evaluations supported! Now you can pass a list or dictionary of evaluation datasets. Each dataset will have its corresponding key that can be accessed in the validation_step function.
Model saving with additional syntax to save best models based on metrics of specific datasets and best metric values of different metrics.
New parameter in Trainer compile_kwargs. This is a dictionary with additional kwargs for torch.compile.
A new better looking progress bar!. This is actually the same library (tqdm), but with colors and less size. Also this new feature removes some weird visual bugs that were overlapping training and validation progress bars.
loop function in Trainer can be modified by inheritance!. If you want to add more customization to your training loop, you can create another trainer class inheriting from Trainer to modify the loop.
Better code, better throughput!. Since this is an almost complete rewrite of the library, every decision in the code was done to optimize throughput and make the code more readable than before for better maintainability.
callback parameter in Trainer can now contain multiple callbacks!.
New disable_model_saving parameter in Trainer.
New safe_mode parameter in Trainer. Running in safe mode (default) means that forward passes will be done using the corresponding wrapper (DDP, FSDP or DeepSpeedEngine). If set_mode=False, this means that wrapper will be skipped and use the model directly. This slightly improves throughput, although it is unsure that gradients will be correctly synchronized across all devices.
With the new addition of multiple evaluations, metrics parameter in Trainer can be a dictionary, where keys are the name of dataset keys and values are the metrics to implement for that particular evaluation dataset. Basically, you can now have different metrics per dataset.

BUG FIXES:

clip_grad parameter in Trainer was not working in DeepSpeed, because the configuration file sets an automatic value of 1.0 (default). We changed this behavior to always specify gradient clipping through clip_grad parameter. The default gradient clipping value, independent of the strategy applied, will always be 0.
grad_accumulation_steps was not being correctly handled and led to incorrect results.
patience was not being correctly handled. If this value was set higher than 0, all model savings, even if they were better than previous results, will end up finishing since patience was always being reduced every time it was time to save the model (after evaluation).
First evaluation set with eval_when_start=True no longer saves the model or checkpoint, because there is no point of saving the model when there is no progress at all.
Last evaluation set with eval_when_finish=True no longer occurs twice whenever another evaluation was done in the last step of an epoch or at the end of an epoch.
Resuming from checkpoint was always setting global seeds depending on the epoch (0, 1, 2, etc), independently if the user set a seed already. To mitigate this, when setting a seed with set_seed function this will save a global seed to access it afterwards in a new epoch, so the seeds will be set to GLOBAL_SEED + EPOCH.
Resuming from checkpoint and logging was resulting in wrong results, both train loss report and current step number (at least on the first log produced after resuming).

Assets 2

19 Mar 04:21

ghanvert

v1.7.7

cffe8e7

v1.7.7

What's Changed

Added 'patience' argument to Trainer by @ghanvert in #12

Full Changelog: v1.7.5...v1.7.7

Contributors

ghanvert

Assets 2

07 Mar 16:46

ghanvert

v1.7.5

c79b7b0

v1.7.5

What's Changed

Added support for CPU execution with a flag "--cpu" in "accmt" command.
Added "IS_CPU" and "IS_GPU" boolean variables to import from accmt library.

Assets 2

19 Feb 17:58

ghanvert

v1.7.4.3

b3557bb

v1.7.4.3

What's Changed

Modified metric system by @ghanvert in #8: Supporting many arguments for 'compute' function (not just 'predictions' and 'references').
Changed metric step numbers by @ghanvert in #10: Metric step numbers now starting at 0 instead of 1.
Added new argument in Trainer 'report_train_loss_per_epoch', to report train loss at the end of every epoch. As of now, you might also want to set 'log_every' to None, to not report loss per step.
Some minor optimizations related to tracking tensors.

Contributors

ghanvert

Assets 2

05 Feb 04:38

ghanvert

v1.7.4.2

198da45

v1.7.4.2 Starting point, preparing everything towards 2.0!

I haven't released tags for a while. From now on, this will be the common practice.

I'm preparing everything for version 2.0, which is going to be the first major release containing a stable version and more features. Feel free to contribute to this project, you can suggest new features or add new ones if you want.

Some of the features that version 2.0 will have are:

Support for all trackers (for now, we're only having stable results with MLFlow and Tensorboard).
HyperParameter Search integration.
Multi-model support. For now, we're only supporting 'model' and 'teacher'.
Better ways to run logic code depending on the state of the internal training loop.
Multiple evaluations in 'Trainer'.
Parallel evaluation out of the box (instead of 'Trainer', we'll have something like 'Evaluator').
Remove need to return 'loss' in 'validation_step' function in 'Trainer'.
Better ways to create/define metrics.
And some more!

Assets 2

12 Jul 21:30

ghanvert

v1.1.2

3d7de7a

v1.1.2

ACCMT v1.1.2 changelog:

Fixed 'function' type error.
'num_warmup_steps' can be a ratio (float value) between 0 and 1, to represent warmup ratio and calculate warmup steps automatically.
'warmup_ratio' added for scheduler configuration if wanted.
Added 'dataloader_pin_memory' option for Trainer arguments.
Added 'dataloader_num_workers' option or Trainer arguments.
Made 'status_dict' be optional to pass as argument in step functions.
Add 'evaluations_done' key to 'status_dict' in case that ACCMT was updated from < 1.1.0 version to a higher one.
'allow_tf32' can now be correctly imported.
Added clean documentation.

Assets 2

09 Jul 23:17

ghanvert

v1.1.1

60b2968

v1.1.1

ACCMT v1.1.1

'num_warmup_steps' can be a ratio (float value) between 0 and 1, to represent warmup ratio and calculate warmup steps automatically.
'warmup_ratio' added for scheduler configuration if wanted.
Added 'dataloader_pin_memory' option for Trainer arguments.
Added 'dataloader_num_workers' option or Trainer arguments.
Made 'status_dict' be optional to pass as argument in step functions.
Add 'evaluations_done' key to 'status_dict' in case that ACCMT was updated from < 1.1.0 version to a higher one.
'allow_tf32' can now be correctly imported.
Added clean documentation.

Assets 2

06 Jul 22:20

ghanvert

v1.1.0

f22ca59

v1.1.0

Bug fix:

When resuming from checkpoint, one batch was being repeated.

Added new features:

Support for automatic FSDP integration and adaptation.
'status_dict' integration for correct model saving and checkpointing. Can be accessed from step functions.
'checkpoint_every' argument can now replace 'enable_checkpointing' and 'checkpoint_strat'.
Checkpointing can be done also every N evaluations.

Install via pip:
pip install -U accmt

Assets 2

Releases: ghanvert/AcceleratorModule

v1.9.0

Fixed:

What's new in Trainer:

What's new in AcceleratorModule:

What's new in metrics:

What's new in Monitor:

[NEW] Hyper Parameter Search

[BETA] Asynchronous Evaluation

When v2.0?

Uh oh!

v1.8.0

Uh oh!

v1.7.7

What's Changed

Contributors

Uh oh!

v1.7.5

What's Changed

Uh oh!

v1.7.4.3

What's Changed

Contributors

Uh oh!

v1.7.4.2 Starting point, preparing everything towards 2.0!

Uh oh!

v1.1.2

Uh oh!

v1.1.1

Uh oh!

v1.1.0

Uh oh!