Fixed:

'self.metrics' being None when there are no metrics implemented, leading to errors.
'breakpoint()' does not work when 'dataloader_num_workers' is default (num_processes) or > 0. When debugging, now 'dataloader_num_workers' is set to 0.
If user returns a single tensor instead of dictionary in 'validation_step', we assume it's the loss value.
'epoch' not correctly tracked.
Added a RuntimeError to notify the user to align 'validation_step' with declared metrics.
When appending metric tensors into a list, we verify if the last tensor added has similar dimensions to the current one, so we don't get an error at the very end of the evaluation.
Added support to run on CPU (this might break again, be careful).
When calling for example 'generate' using DDP (or maybe FSDP as well), an internal function from nn.Module created by the user, this gives an error because DistributedDataParallel has no attribute called 'generate'. This does not happen with DeepSpeedEngine, since it knows what functions are intended for inference and training.
Gradient normalization (type 2) is now calculated without creating an intermediary array of tensors that could lead to some memory overhead.
'grad_norm' not being reported.
When using DeepSpeed, floating-point elements in batch needed an autocast handled by the user (since inputs to the model need to be in half precision). We now handle this in the background so the user does not have to worry.
Tensor Cores are only enabled if the system supports it.

What's new in Trainer:

New argument 'batch_device_placement' to control batch in CPU or GPU.
New argument 'prepare_batch' to handle specific cases for autocast in batch (DeepSpeed).
New argument 'safe_steps' to re-try one time running through a batch.
New argument 'destroy_after_training' to kill process group at the end of 'fit' function.
New argument 'enable_prepare_logging' that enables logging preparation. It handles extreme scenarios like DeepSpeed that adds many logging steps when preparing the model. By default is set to False, so users will no longer see DeepSpeed messages by default.
New argument 'multiple_checkpoints' to enable multiple checkpoints.
New argument 'max_checkpoints' to set a maximum number of checkpoints to save in disk.
New argument 'gradient_checkpointing' to enable gradient checkpointing if implemented in nn.Module (like in the transformers library).
New argument 'compile_kwargs' to add additional customization for torch.compile.
New argument safe_mode. Forward passes of the model will run through the wrapper instead of skipping it (old behavior).
'log_with' now receives an string value instead of a class (like 'mlflow').
NOTE: We're only supporting MLFlow for now as a tracker. Future updates will have all trackers implemented.

What's new in AcceleratorModule:

New 'log' function that logs a dictionary of key-values to the tracker, taking in consideration the 'log_every' argument in Trainer. Use 'log_' to avoid this consideration.
New 'freeze' helper function to freeze a module (requires_grad=False).
New 'unfreeze' helper function to unfreeze a module (requires_grad=True).
New 'pad' helper function to pad tensors.

What's new in metrics:

Tensors are converted from half precision to float32 precision.
New 'MetricParallel' module. This is the same implementation of 'Metric', but instead uses all processes to execute evaluation, which means tensors are not being gathered to the main process. When computing the final value, a communication happens across all processes to average metric values. This is useful when calculating metrics in a single process is too slow, and the process itself can run in parallel.

What's new in Monitor:

Added an extra argument 'checkpoint' to log to the tracker whenever a checkpoint is made.

[NEW] Hyper Parameter Search

This is a new implementation to run hyper parameter search using Optuna as the backend. Everything is handled by HyperParameterSearch that can be imported directly from 'accmt'.

[BETA] Asynchronous Evaluation

Sometimes evaluation and model saving can be a big bottleneck. If you assign at least 10% of your resources to evaluation and 90% for training, you could see 30%+ speedups, as training process no longer waits for evaluation to finish, instead it dispatches evaluations to an evaluation group that will wait for the training group to send a request. One way to do this is to write models to disk and have another process waiting for models to appear. Instead, we take a different approach: we move the model parameters directly to the evaluation group without writing to disk. This works because there's a CPU tunnel between the training and evaluation group where groups can send data (SharedMemory or SHM).

Why is it a big deal? nothing from your code changes. You only change your launch command:
Before

accmt launch -n=0,1,2,3 ...

After

accmt alaunch -n=0,1,2 -e=3 ...

Here, 'alaunch' or 'async-launch' will handle everything for you, and you're required to specify GPUs for training '-n' or '-N', and '-e' for evaluation.

WARNING: This is a BETA feature and is not meant to be used as of yet.

When v2.0?

There are some tests that I need to implement to keep everything nice and not suddenly break something. Also there are many bugs and errors to handle, as well as to implement all trackers.

v2.0 is meant to be the first stable version and official release that engineers or researchers could use for their work.

I am planning to also implement multi-node launches. Since we use Accelerate as the backend, this library is pretty scalable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.9.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Fixed:

What's new in Trainer:

What's new in AcceleratorModule:

What's new in metrics:

What's new in Monitor:

[NEW] Hyper Parameter Search

[BETA] Asynchronous Evaluation

When v2.0?

Uh oh!