fix(models): processor chunking #629

japols · 2025-10-23T13:53:11Z

Description

Remove the separate ProcessorChunk class and flatten all layers directly into the BaseProcessor.
Chunking is now handled dynamically at runtime by grouping layers into checkpointed segments.

What problem does this change solve?

Previously, the Processor class held a list of ProcessorChunks which held its own ModuleList of layers, meaning that checkpointed layer groupings were tied to the chunking configuration saved in the model checkpoint.
When resuming training with a different num_chunks, the restored module structure no longer matched the saved one, causing checkpoint mismatches. Now we only have one flat list of all layers (Blocks) in the Processor Class and chunking is handled dynamically.

What issue or task does this change relate to?

Additional notes

Tested with all models, i.e. GT, Transformer, GNN, PointWiseMLP

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.

📚 Documentation preview 📚: https://anemoi-training--629.org.readthedocs.build/en/629/

📚 Documentation preview 📚: https://anemoi-graphs--629.org.readthedocs.build/en/629/

📚 Documentation preview 📚: https://anemoi-models--629.org.readthedocs.build/en/629/

training/src/anemoi/training/config/model/point_wise.yaml

models/src/anemoi/models/layers/processor.py

for more information, see https://pre-commit.ci

anaprietonem · 2025-11-12T08:59:23Z

@japols could be worth updating the docs to reflect this change - https://github.com/ecmwf/anemoi-core/blob/main/models/docs/introduction/overview.rst - I guess the section regarding the chunking of the Processors?

models/src/anemoi/models/layers/block.py

models/src/anemoi/models/layers/processor.py

jakob-schloer

Nice work! The processor is much cleaner to read now.

Could you maybe update the documentation to quickly explain what chunks are in the processor and how they relate to layers.

models/src/anemoi/models/layers/processor.py

Co-authored-by: Jakob Schloer <[email protected]>

for more information, see https://pre-commit.ci

Co-authored-by: Jakob Schloer <[email protected]>

jakob-schloer

Thanks for updating! Looks good to me now.

training/src/anemoi/training/utils/checkpoint.py

anaprietonem

LGTM, I think extending the comment in the transfer learning function to explain the need for applying the sync could be good. I triggered the integration tests - if those pass I think it's good to be merged https://github.com/ecmwf/anemoi-core/actions/workflows/integration-tests-hpc.yml

training/src/anemoi/training/utils/checkpoint.py

## Description Remove the separate ProcessorChunk class and flatten all layers directly into the BaseProcessor. Chunking is now handled dynamically at runtime by grouping layers into checkpointed segments. ## What problem does this change solve? Previously, the Processor class held a list of ProcessorChunks which held its own ModuleList of layers, meaning that checkpointed layer groupings were tied to the chunking configuration saved in the model checkpoint. When resuming training with a different num_chunks, the restored module structure no longer matched the saved one, causing checkpoint mismatches. Now we only have one flat list of all layers (Blocks) in the Processor Class and chunking is handled dynamically. ## What issue or task does this change relate to?  ## Additional notes ## Tested with all models, i.e. GT, Transformer, GNN, PointWiseMLP ***As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/*** By opening this pull request, I affirm that all authors agree to the [Contributor License Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md)  ---- 📚 Documentation preview 📚: https://anemoi-training--629.org.readthedocs.build/en/629/   ---- 📚 Documentation preview 📚: https://anemoi-graphs--629.org.readthedocs.build/en/629/   ---- 📚 Documentation preview 📚: https://anemoi-models--629.org.readthedocs.build/en/629/  --------- Co-authored-by: Simon Lang <[email protected]> Co-authored-by: gabrieloks <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ana Prieto Nemesio <[email protected]> Co-authored-by: Jakob Schloer <[email protected]>

fix(models)!: processor chunking

0c4d587

japols requested review from jakob-schloer and ssmmnn11 October 23, 2025 13:53

japols self-assigned this Oct 23, 2025

japols added this to Anemoi-dev Oct 23, 2025

japols added bug Something isn't working breaking change ATS Approval Needed Approval needed by ATS labels Oct 23, 2025

github-project-automation bot moved this to To be triaged in Anemoi-dev Oct 23, 2025

github-actions bot added training models and removed training models labels Oct 23, 2025

japols commented Oct 23, 2025

View reviewed changes

training/src/anemoi/training/config/model/point_wise.yaml Show resolved Hide resolved

japols changed the title ~~fix/refactor(models)!: processor chunking~~ fix(models)!: processor chunking Oct 23, 2025

matschreiner reviewed Oct 24, 2025

View reviewed changes

models/src/anemoi/models/layers/processor.py Show resolved Hide resolved

models/src/anemoi/models/layers/processor.py Show resolved Hide resolved

Merge branch 'main' into fix/processor-chunking

c3fe0d4

anaprietonem added ATS Approved Approved by ATS and removed ATS Approval Needed Approval needed by ATS labels Oct 29, 2025

gabrieloks and others added 6 commits November 4, 2025 11:35

Merge branch 'main' into fix/processor-chunking

f132425

add: checkpoint migration

3cf1aa5

[pre-commit.ci] auto fixes from pre-commit.com hooks

5413694

for more information, see https://pre-commit.ci

fix: pytests

f7e1744

fix: pytest

170db73

feedback: build_layers signature

a8edac8

japols requested a review from anaprietonem November 5, 2025 14:06

japols changed the title ~~fix(models)!: processor chunking~~ fix(models): processor chunking Nov 6, 2025

japols removed the breaking change label Nov 6, 2025

anaprietonem reviewed Nov 12, 2025

View reviewed changes

models/src/anemoi/models/layers/block.py Show resolved Hide resolved

models/src/anemoi/models/layers/block.py Outdated Show resolved Hide resolved

models/src/anemoi/models/layers/processor.py Outdated Show resolved Hide resolved

jakob-schloer requested changes Nov 14, 2025

View reviewed changes

models/src/anemoi/models/layers/processor.py Outdated Show resolved Hide resolved

models/src/anemoi/models/layers/processor.py Outdated Show resolved Hide resolved

github-project-automation bot moved this from To be triaged to Under Review in Anemoi-dev Nov 14, 2025

japols and others added 8 commits November 17, 2025 10:00

docs: BaseProcessor docstring

3c87250

Co-authored-by: Jakob Schloer <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

0933cd1

for more information, see https://pre-commit.ci

docs: run_layers return type

cba8662

Co-authored-by: Jakob Schloer <[email protected]>

make processor chunking only configurable via env variable in inference

12022d8

docs: processor chunking

ce99a1d

fix: pytest

b66721c

Merge branch 'main' into fix/processor-chunking

3a734af

fix: migration order

a8484ad

jakob-schloer previously approved these changes Nov 17, 2025

View reviewed changes

manualy apply chunking migration in transfer learning

9516fc0

japols dismissed jakob-schloer’s stale review via 9516fc0 November 18, 2025 13:58

anaprietonem reviewed Nov 18, 2025

View reviewed changes

training/src/anemoi/training/utils/checkpoint.py Show resolved Hide resolved

Merge branch 'main' into fix/processor-chunking

d6bfaf9

anaprietonem self-requested a review November 18, 2025 15:05

anaprietonem previously approved these changes Nov 18, 2025

View reviewed changes

japols commented Nov 19, 2025

View reviewed changes

training/src/anemoi/training/utils/checkpoint.py Show resolved Hide resolved

update comment

0cc46ae

japols dismissed anaprietonem’s stale review via 0cc46ae November 19, 2025 10:32

anaprietonem self-requested a review November 19, 2025 10:53

Merge branch 'main' into fix/processor-chunking

863e0e4

anaprietonem approved these changes Nov 19, 2025

View reviewed changes

anaprietonem merged commit 06e5533 into main Nov 19, 2025
14 checks passed

anaprietonem deleted the fix/processor-chunking branch November 19, 2025 12:00

github-project-automation bot moved this from Under Review to Done in Anemoi-dev Nov 19, 2025

DeployDuck mentioned this pull request Nov 19, 2025

chore: Release main #684

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(models): processor chunking #629

fix(models): processor chunking #629

Uh oh!

japols commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anaprietonem commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakob-schloer left a comment

Uh oh!

Uh oh!

Uh oh!

jakob-schloer left a comment

Uh oh!

Uh oh!

anaprietonem left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

fix(models): processor chunking #629

fix(models): processor chunking #629

Uh oh!

Conversation

japols commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What problem does this change solve?

What issue or task does this change relate to?

Additional notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anaprietonem commented Nov 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jakob-schloer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jakob-schloer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anaprietonem left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

japols commented Oct 23, 2025 •

edited by github-actions bot

Loading