-
Notifications
You must be signed in to change notification settings - Fork 71
Description
During 2025, I plan to work on the fine-tuning of Anemoi models. This issue should work as a roadmap and info-dump, as well as a discussion ground for different users.
Current State
The current state of fine-tuning in Anemoi revolves around re-training the entire model.
While this makes sense in pre-training the dynamics with single-step rollout and then finalizing the training with multi-step rollout, it is disadvantageous when we want to fine-tune on data with different distributions. This is the naive fine-tuning approach:
- Train model
$M$ with weights$W$ on dataset (or task)$A$ - Train model
$M'$ modifying the weights$W$ to$W'$ with a lower learning rate (and other tricks to make it work)
The disadvantages here are that the extensive pre-training of the model often captures nuances that may be "forgotten" by retraining the model. This has been observed in the training process and often leads to a lot of "finicky" training set-ups.
Implementation details
This is currently implemented as warm-starts and forking of the training, is implemented in multiple config entries in training:
# resume or fork a training from a checkpoint last.ckpt or specified in hardware.files.warm_start
run_id: null
fork_run_id: null
transfer_learning: False # activate to perform transfer learning
load_weights_only: False # only load model weights, do not restore optimiser states etc.cf:
| # resume or fork a training from a checkpoint last.ckpt or specified in hardware.files.warm_start | |
| run_id: null | |
| fork_run_id: null | |
| transfer_learning: False # activate to perform transfer learning | |
| load_weights_only: False # only load model weights, do not restore optimiser states etc. |
This has implications for traceability and the automation of training pipelines. But it also has certain drawbacks: the forking of a run has to be specified by run_id that is expected to be on the same system, which discourages collaboration and sharing of checkpoints. These were design choices made early in the project before we anticipated the breadth of adoption.
Proposed Solution
To address these limitations, I propose implementing a comprehensive fine-tuning capability in Anemoi that includes:
1. Enhanced Model Freezing
Building on PR #61, we need to extend the submodule freezing functionality to enable more granular control over which parts of the model are trainable during fine-tuning. This will allow:
- Freezing arbitrary submodules at multiple levels of the model hierarchy
- Partial freezing of specific parameter groups
- Possibly even differential learning rates for different model components
2. Integration with PEFT Library
Rather than implementing Parameter-Efficient Fine-Tuning (PEFT) methods from scratch, I propose integrating with Hugging Face's PEFT library (https://huggingface.co/docs/peft/). This will provide access to multiple state-of-the-art fine-tuning methods including:
- LoRA (Low-Rank Adaptation): A technique that significantly reduces the number of trainable parameters by injecting trainable low-rank matrices into each layer of the model.
LoRA parameterizes the weight updates as
$\Delta W = BA$ , where$B \in \mathbb{R}^{(d\times r)}$ and$A \in \mathbb{R}^{(r\times k)}$ are low-rank matrices$(r \ll \min(d,k))$ . Instead of fine-tuning all parameters in$W \in \mathbb{R}^{(d\times k)}$ , we only need to train$r(d+k)$ parameters, resulting in significant memory savings while maintaining model performance.
For a given matrix W, the output is computed as: h = Wx + BAx, where only BA is trained while W remains frozen.
- QLoRA: Quantized LoRA for even more memory-efficient fine-tuning
- Prefix Tuning: Optimizes a small continuous task-specific vector (prefix) while keeping the model frozen
- Prompt Tuning: Fine-tunes continuous prompts prepended to inputs
- AdaLoRA: Adaptive budget allocation across weight matrices based on importance
3. Decoupling of Checkpoint loading and IDs
The checkpoints should be loadable independent of the IDs and current systems, or at least side-loading functionality should exist.
4. Enhanced Configuration System
Expand the configuration system to support:
training:
fine_tuning:
enabled: True
strategy: "lora" # Options: "full", "freeze", "lora", "qlora", "prompt", "prefix", etc.
checkpoint:
source: "s3://anemoi-models/global-10day/v1.2.3/checkpoint.pt" # Remote sources supported
local_cache: "~/.anemoi/cache/"
peft:
rank: 8 # For LoRA-based methods
alpha: 16 # Scaling factor
dropout: 0.1
freeze:
modules: ["encoder.block.0", "encoder.block.1"] # Explicit module paths
patterns: ["encoder.block.[2-11]", "processor.*"] # Regex patterns supported
optimizer:
differential_lr: True
lr_groups:
- modules: ["decoder.*"]
lr: 1e-4
- modules: ["peft_layers.*"]
lr: 5e-4It would also be possible to implement an optional fine-tuning config. This will become more clear during implementation, I believe.
5. Integration with Training Pipelines
Collaborate with colleagues working on training pipeline automation to ensure:
- Fine-tuning configurations can be versioned and tracked
- Automated experimentation can be conducted across fine-tuning hyperparameters
- Results from fine-tuning can be systematically compared and evaluated
6. Maintain full traceability of training settings
When implementing new configs and possible training pipelines, the different model training stages should be reflected in the provenance of the model to ensure high re-traceability of a model.
Implementation Plan
Phase 1: Foundation
- Enhance the existing model freezing functionality (building on PR feat: Model Freezing βοΈ Β #61)
- Create initial integration with PEFT library for LoRA
Phase 2: Core Functionality
- Implement enhanced configuration system
- Complete PEFT integration with all major methods
- Develop comprehensive testing suite for fine-tuning capabilities
- Create documentation and examples
Phase 3: Advanced Features and Integration
- Integrate with training pipeline automation
- Optimize performance for large-scale fine-tuning
- Create benchmarking tools for fine-tuning approaches
Success Criteria
The fine-tuning capability will be considered successful when:
- Users can fine-tune Anemoi models with a single configuration change
- Memory usage during fine-tuning is reduced by at least 70% compared to full fine-tuning
- Fine-tuned models maintain or improve performance metrics compared to current approaches
- Checkpoint sharing and collaboration becomes seamless across different systems
- Documentation and examples make the system approachable for new users
Alternatives Considered
Custom Implementation of PEFT Methods
While implementing our own versions of PEFT methods would give us maximum control, it would require significant development and maintenance effort. The Hugging Face PEFT library is well-maintained, extensively tested, and continuously updated with new methods, making it a more sustainable choice.
Adapter-Based Approaches
Traditional adapter approaches insert new modules between existing layers. While effective, this can change the model architecture significantly. LoRA and similar methods preserve the original architecture while fine-tuning, which aligns better with our goals. (Although technically PEFT also uses adapters...)
Full Model Distillation
Knowledge distillation could be used to transfer knowledge from the pre-trained model to a task-specific model. However, this approach requires training a new model from scratch for each task, which is computationally expensive and doesn't leverage the efficiency gains of modern fine-tuning techniques.
Additional Context
This work on fine-tuning capabilities aligns with broader industry trends toward more efficient adaptation of large models. By implementing these capabilities in Anemoi, we'll enable users to:
- Adapt global models to regional domains with minimal computational resources
- Fine-tune on specific weather phenomena without degrading general performance
- Build ensembles of specialized models derived from a common base
- Collaborate more effectively by sharing and building upon each other's work
Organization
ECMWF
Sub-issues
Metadata
Metadata
Assignees
Labels
Type
Projects
Status