-
Notifications
You must be signed in to change notification settings - Fork 67
fix(models): processor chunking #629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@japols could be worth updating the docs to reflect this change - https://github.com/ecmwf/anemoi-core/blob/main/models/docs/introduction/overview.rst - I guess the section regarding the chunking of the Processors? |
jakob-schloer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! The processor is much cleaner to read now.
Could you maybe update the documentation to quickly explain what chunks are in the processor and how they relate to layers.
Co-authored-by: Jakob Schloer <[email protected]>
for more information, see https://pre-commit.ci
Co-authored-by: Jakob Schloer <[email protected]>
jakob-schloer
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating! Looks good to me now.
anaprietonem
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I think extending the comment in the transfer learning function to explain the need for applying the sync could be good. I triggered the integration tests - if those pass I think it's good to be merged https://github.com/ecmwf/anemoi-core/actions/workflows/integration-tests-hpc.yml
## Description Remove the separate ProcessorChunk class and flatten all layers directly into the BaseProcessor. Chunking is now handled dynamically at runtime by grouping layers into checkpointed segments. ## What problem does this change solve? Previously, the Processor class held a list of ProcessorChunks which held its own ModuleList of layers, meaning that checkpointed layer groupings were tied to the chunking configuration saved in the model checkpoint. When resuming training with a different num_chunks, the restored module structure no longer matched the saved one, causing checkpoint mismatches. Now we only have one flat list of all layers (Blocks) in the Processor Class and chunking is handled dynamically. ## What issue or task does this change relate to? <!-- link to Issue Number --> ## Additional notes ## Tested with all models, i.e. GT, Transformer, GNN, PointWiseMLP ***As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/*** By opening this pull request, I affirm that all authors agree to the [Contributor License Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md) <!-- readthedocs-preview anemoi-training start --> ---- 📚 Documentation preview 📚: https://anemoi-training--629.org.readthedocs.build/en/629/ <!-- readthedocs-preview anemoi-training end --> <!-- readthedocs-preview anemoi-graphs start --> ---- 📚 Documentation preview 📚: https://anemoi-graphs--629.org.readthedocs.build/en/629/ <!-- readthedocs-preview anemoi-graphs end --> <!-- readthedocs-preview anemoi-models start --> ---- 📚 Documentation preview 📚: https://anemoi-models--629.org.readthedocs.build/en/629/ <!-- readthedocs-preview anemoi-models end --> --------- Co-authored-by: Simon Lang <[email protected]> Co-authored-by: gabrieloks <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ana Prieto Nemesio <[email protected]> Co-authored-by: Jakob Schloer <[email protected]>
Description
Remove the separate ProcessorChunk class and flatten all layers directly into the BaseProcessor.
Chunking is now handled dynamically at runtime by grouping layers into checkpointed segments.
What problem does this change solve?
Previously, the Processor class held a list of ProcessorChunks which held its own ModuleList of layers, meaning that checkpointed layer groupings were tied to the chunking configuration saved in the model checkpoint.
When resuming training with a different num_chunks, the restored module structure no longer matched the saved one, causing checkpoint mismatches. Now we only have one flat list of all layers (Blocks) in the Processor Class and chunking is handled dynamically.
What issue or task does this change relate to?
Additional notes
Tested with all models, i.e. GT, Transformer, GNN, PointWiseMLP
As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/
By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.
📚 Documentation preview 📚: https://anemoi-training--629.org.readthedocs.build/en/629/
📚 Documentation preview 📚: https://anemoi-graphs--629.org.readthedocs.build/en/629/
📚 Documentation preview 📚: https://anemoi-models--629.org.readthedocs.build/en/629/