Skip to content

Conversation

@japols
Copy link
Member

@japols japols commented Oct 23, 2025

Description

Remove the separate ProcessorChunk class and flatten all layers directly into the BaseProcessor.
Chunking is now handled dynamically at runtime by grouping layers into checkpointed segments.

What problem does this change solve?

Previously, the Processor class held a list of ProcessorChunks which held its own ModuleList of layers, meaning that checkpointed layer groupings were tied to the chunking configuration saved in the model checkpoint.
When resuming training with a different num_chunks, the restored module structure no longer matched the saved one, causing checkpoint mismatches. Now we only have one flat list of all layers (Blocks) in the Processor Class and chunking is handled dynamically.

What issue or task does this change relate to?

Additional notes

Tested with all models, i.e. GT, Transformer, GNN, PointWiseMLP

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.


📚 Documentation preview 📚: https://anemoi-training--629.org.readthedocs.build/en/629/


📚 Documentation preview 📚: https://anemoi-graphs--629.org.readthedocs.build/en/629/


📚 Documentation preview 📚: https://anemoi-models--629.org.readthedocs.build/en/629/

@japols japols self-assigned this Oct 23, 2025
@japols japols added bug Something isn't working breaking change ATS Approval Needed Approval needed by ATS labels Oct 23, 2025
@github-project-automation github-project-automation bot moved this to To be triaged in Anemoi-dev Oct 23, 2025
@japols japols changed the title fix/refactor(models)!: processor chunking fix(models)!: processor chunking Oct 23, 2025
@anaprietonem anaprietonem added ATS Approved Approved by ATS and removed ATS Approval Needed Approval needed by ATS labels Oct 29, 2025
@japols japols requested a review from anaprietonem November 5, 2025 14:06
@japols japols changed the title fix(models)!: processor chunking fix(models): processor chunking Nov 6, 2025
@anaprietonem
Copy link
Contributor

@japols could be worth updating the docs to reflect this change - https://github.com/ecmwf/anemoi-core/blob/main/models/docs/introduction/overview.rst - I guess the section regarding the chunking of the Processors?

Copy link
Collaborator

@jakob-schloer jakob-schloer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! The processor is much cleaner to read now.

Could you maybe update the documentation to quickly explain what chunks are in the processor and how they relate to layers.

@github-project-automation github-project-automation bot moved this from To be triaged to Under Review in Anemoi-dev Nov 14, 2025
jakob-schloer
jakob-schloer previously approved these changes Nov 17, 2025
Copy link
Collaborator

@jakob-schloer jakob-schloer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating! Looks good to me now.

@anaprietonem anaprietonem self-requested a review November 18, 2025 15:05
anaprietonem
anaprietonem previously approved these changes Nov 18, 2025
Copy link
Contributor

@anaprietonem anaprietonem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I think extending the comment in the transfer learning function to explain the need for applying the sync could be good. I triggered the integration tests - if those pass I think it's good to be merged https://github.com/ecmwf/anemoi-core/actions/workflows/integration-tests-hpc.yml

@anaprietonem anaprietonem merged commit 06e5533 into main Nov 19, 2025
14 checks passed
@anaprietonem anaprietonem deleted the fix/processor-chunking branch November 19, 2025 12:00
@github-project-automation github-project-automation bot moved this from Under Review to Done in Anemoi-dev Nov 19, 2025
@DeployDuck DeployDuck mentioned this pull request Nov 19, 2025
OpheliaMiralles pushed a commit that referenced this pull request Nov 19, 2025
## Description
Remove the separate ProcessorChunk class and flatten all layers directly
into the BaseProcessor.
Chunking is now handled dynamically at runtime by grouping layers into
checkpointed segments.

## What problem does this change solve?
Previously, the Processor class held a list of ProcessorChunks which
held its own ModuleList of layers, meaning that checkpointed layer
groupings were tied to the chunking configuration saved in the model
checkpoint.
When resuming training with a different num_chunks, the restored module
structure no longer matched the saved one, causing checkpoint
mismatches. Now we only have one flat list of all layers (Blocks) in the
Processor Class and chunking is handled dynamically.

## What issue or task does this change relate to?
<!-- link to Issue Number -->

##  Additional notes ##
Tested with all models, i.e. GT, Transformer, GNN, PointWiseMLP

***As a contributor to the Anemoi framework, please ensure that your
changes include unit tests, updates to any affected dependencies and
documentation, and have been tested in a parallel setting (i.e., with
multiple GPUs). As a reviewer, you are also responsible for verifying
these aspects and requesting changes if they are not adequately
addressed. For guidelines about those please refer to
https://anemoi.readthedocs.io/en/latest/***

By opening this pull request, I affirm that all authors agree to the
[Contributor License
Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md)


<!-- readthedocs-preview anemoi-training start -->
----
📚 Documentation preview 📚:
https://anemoi-training--629.org.readthedocs.build/en/629/

<!-- readthedocs-preview anemoi-training end -->

<!-- readthedocs-preview anemoi-graphs start -->
----
📚 Documentation preview 📚:
https://anemoi-graphs--629.org.readthedocs.build/en/629/

<!-- readthedocs-preview anemoi-graphs end -->

<!-- readthedocs-preview anemoi-models start -->
----
📚 Documentation preview 📚:
https://anemoi-models--629.org.readthedocs.build/en/629/

<!-- readthedocs-preview anemoi-models end -->

---------

Co-authored-by: Simon Lang <[email protected]>
Co-authored-by: gabrieloks <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ana Prieto Nemesio <[email protected]>
Co-authored-by: Jakob Schloer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ATS Approved Approved by ATS bug Something isn't working models training

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

9 participants