fix(training)!: Refactor configuration by introducing system schema with hardware, paths, and files subschemas#598
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the configuration structure by introducing a new top-level system schema that groups hardware, paths, and files subschemas. Previously, paths and files were defined under hardware, which was confusing since they describe filesystem layout rather than compute resources. The new structure separates hardware configuration from I/O configuration, making the hierarchy more intuitive.
- Moved paths and files from
hardwareto newsystemschema - Created separate
hardwaresubschema specifically for compute resources - Updated all references throughout codebase from
config.hardware.*toconfig.system.*
Reviewed Changes
Copilot reviewed 48 out of 48 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| training/src/anemoi/training/schemas/system.py | Introduces new SystemSchema with hardware, files, and paths subschemas |
| training/src/anemoi/training/schemas/base_schema.py | Updates schema to use SystemSchema instead of HardwareSchema |
| training/src/anemoi/training/train/train.py | Updates all config references to use new system schema structure |
| training/src/anemoi/training/config/hardware/* | Restructures hardware config files to separate compute from I/O settings |
| tests and config files | Updates all configuration references to use new system schema structure |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
I like this, but not sure if 'storage' captures what used to be described under 'paths'. Maybe 'directories' or... ? |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
dietervdb-meteo
left a comment
There was a problem hiding this comment.
As previously, LGTM
## Description <!-- What issue or task does this change relate to? --> This PR removes unused code: - ensdatamodule.py reintroduced in #598 - some references introduced in #651 that are not used inside the datamodule - others: `n_samples_per_epoch_total` and `n_samples_per_epoch_per_worker`. ***As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/*** By opening this pull request, I affirm that all authors agree to the [Contributor License Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md)
|
Are changes from this PR reflected in the user documentation or is anybody working on it? I am still seeing the old "hardware" section referenced in the doc. |
|
Very good point... I overlooked that when reviewing, sorry. So used to go and look for config changes in the default configs inside the code rather than the docs... |
|
@frazane , can you make an issue? |
|
I have already opened an issue (#703 ), and started to document some of these |
🤖 Automated Release PR This PR was created by `release-please` to prepare the next release. Once merged: 1. A new version tag will be created 2. A GitHub release will be published 3. The changelog will be updated Changes to be included in the next release: --- <details><summary>training: 0.8.0</summary> ## [0.8.0](training-0.7.0...training-0.8.0) (2025-12-05) ### ⚠ BREAKING CHANGES * **training:** Refactor configuration by introducing system schema with hardware, paths, and files subschemas ([#598](#598)) * cond layer norm ([#658](#658)) ### Features * Activate minmium plotting for integration tests ([#669](#669)) ([84e5882](84e5882)) * Compile transformer gnn ([#181](#181)) ([24d162c](24d162c)) * **models:** Add configurable residual connections in enc-proc-dec ([#670](#670)) ([aeaf00b](aeaf00b)) * **models:** Triton GraphTransformer ([#631](#631)) ([b40b6c6](b40b6c6)) * Time_interpolator_callbacks ([#677](#677)) ([c2b8179](c2b8179)) * **training:** Performance docs ([#696](#696)) ([9574ff1](9574ff1)) * **training:** Refactor optimizer creation to support custom and torch optimizers ([#588](#588)) ([cd777fb](cd777fb)) ### Bug Fixes * Add package config path to Hydra search path in plugin ([#656](#656)) ([ca6f732](ca6f732)) * Cond layer norm ([#658](#658)) ([7315e3a](7315e3a)) * **logger:** Bugs in AzureMLFlowLogger from [#646](#646) ([#685](#685)) ([14c0235](14c0235)) * **models:** Processor chunking ([#629](#629)) ([06e5533](06e5533)) * Pass weights_only for pytorch lightning >= 2.6.0 ([#713](#713)) ([7446942](7446942)) * Ptl 2.6.0 explicitly pass weights_only=False ([#710](#710)) ([e18824c](e18824c)) * RolloutEval sharding ([#714](#714)) ([0fbc071](0fbc071)) * Slurm system config ([#702](#702)) ([cce8763](cce8763)) * Target docs ([#704](#704)) ([200101e](200101e)) * **training,tasks:** Abstract RolloutForecasting task ([#682](#682)) ([f14fc32](f14fc32)) * **training:** CombinedLoss schema validation ([#719](#719)) ([dba4268](dba4268)) * **training:** Refactor configuration by introducing system schema with hardware, paths, and files subschemas ([#598](#598)) ([da02fe7](da02fe7)) * **training:** Remove unused code ([#706](#706)) ([f49813a](f49813a)) </details> <details><summary>graphs: 0.8.0</summary> ## [0.8.0](graphs-0.7.2...graphs-0.8.0) (2025-12-05) ### ⚠ BREAKING CHANGES * **edges:** Edge feature revision #643 ([#727](#727)) ### Features * **edges:** Edge feature revision ([#643](#643)) ([720f4d8](720f4d8)) * **edges:** Edge feature revision [#643](#643) ([#727](#727)) ([d1372cf](d1372cf)) * **graphs:** Support for multi-scale connections with HEALPix hidden grid ([#691](#691)) ([1450787](1450787)) ### Bug Fixes * Revert "feat(edges): Edge feature revision" ([#726](#726)) ([db1f940](db1f940)) * Sparse export ([#686](#686)) ([969b787](969b787)) * Target docs ([#704](#704)) ([200101e](200101e)) </details> <details><summary>models: 0.11.0</summary> ## [0.11.0](models-0.10.0...models-0.11.0) (2025-12-05) ### ⚠ BREAKING CHANGES * **training:** Refactor configuration by introducing system schema with hardware, paths, and files subschemas ([#598](#598)) * cond layer norm ([#658](#658)) ### Features * Compile transformer gnn ([#181](#181)) ([24d162c](24d162c)) * **models:** Add configurable residual connections in enc-proc-dec ([#670](#670)) ([aeaf00b](aeaf00b)) * **models:** Multibackend all_to_all wrapper ([#95](#95)) ([6819be1](6819be1)) * **models:** Triton GraphTransformer ([#631](#631)) ([b40b6c6](b40b6c6)) ### Bug Fixes * Compile pickle error ([#708](#708)) ([f4fc4ab](f4fc4ab)) * Cond layer norm ([#658](#658)) ([7315e3a](7315e3a)) * **models:** Processor chunking ([#629](#629)) ([06e5533](06e5533)) * Predict_step shard shapes ([#692](#692)) ([be9ff8b](be9ff8b)) * Remove import of anemoi training in compile ([#705](#705)) ([f7d5ae4](f7d5ae4)) * Small pytorch boxcox inefficiency ([#683](#683)) ([66b40e0](66b40e0)) * **training:** Refactor configuration by introducing system schema with hardware, paths, and files subschemas ([#598](#598)) ([da02fe7](da02fe7)) </details> --- > [!IMPORTANT] > Please do not change the PR title, manifest file, or any other automatically generated content in this PR unless you understand the implications. Changes here can break the release process. > >⚠️ Merging this PR will: > - Create a new release > - Trigger deployment pipelines > - Update package versions **Before merging:** - Ensure all tests pass - Review the changelog carefully - Get required approvals [Release-please documentation](https://github.com/googleapis/release-please)
…ith hardware, paths, and files subschemas (#598) # Description This PR reorganizes the configuration structure by introducing a new top-level schema called system, which groups the subschemas hardware, storage, and files (see [issue #513](#513)). Previously, paths and files were defined under hardware, which was confusing since they describe filesystem layout and not than compute resources. It also wasn’t clear whether paths refer to directories or files. It has now been renamed to storage, clarifying its role as the definition of directory structure for inputs, outputs, and logs. ``` system/ ├── hardware.yaml ├── input.yaml └── output.yaml ``` The PR also isolates the concatenation logic for paths in the pydantic scheme so we don't need to write out the full paths for all outputs/logs/etc in each field. This is very brittle and used to happen in both code throughout the framework and inside the nested configuration files. This is now isolated to happen in only one place. <!-- readthedocs-preview anemoi-training start --> ---- 📚 Documentation preview 📚: https://anemoi-training--598.org.readthedocs.build/en/598/ <!-- readthedocs-preview anemoi-training end --> <!-- readthedocs-preview anemoi-graphs start --> ---- 📚 Documentation preview 📚: https://anemoi-graphs--598.org.readthedocs.build/en/598/ <!-- readthedocs-preview anemoi-graphs end --> <!-- readthedocs-preview anemoi-models start --> ---- 📚 Documentation preview 📚: https://anemoi-models--598.org.readthedocs.build/en/598/ <!-- readthedocs-preview anemoi-models end --> --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: anaprietonem <ana.prietonemesio@ecmwf.int> Co-authored-by: Ana Prieto Nemesio <91897203+anaprietonem@users.noreply.github.com> Co-authored-by: Dieter Van den Bleeken <dieter.vandenbleeken@meteo.be> Co-authored-by: Mario Santa Cruz <48736305+JPXKQX@users.noreply.github.com>
## Description <!-- What issue or task does this change relate to? --> This PR removes unused code: - ensdatamodule.py reintroduced in #598 - some references introduced in #651 that are not used inside the datamodule - others: `n_samples_per_epoch_total` and `n_samples_per_epoch_per_worker`. ***As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/*** By opening this pull request, I affirm that all authors agree to the [Contributor License Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md)
Description
This PR reorganizes the configuration structure by introducing a new top-level schema called system, which groups the subschemas hardware, storage, and files (see issue #513).
Previously, paths and files were defined under hardware, which was confusing since they describe filesystem layout and not than compute resources.
It also wasn’t clear whether paths refer to directories or files. It has now been renamed to storage, clarifying its role as the definition of directory structure for inputs, outputs, and logs.
The PR also isolates the concatenation logic for paths in the pydantic scheme so we don't need to write out the full paths for all outputs/logs/etc in each field. This is very brittle and used to happen in both code throughout the framework and inside the nested configuration files. This is now isolated to happen in only one place.
📚 Documentation preview 📚: https://anemoi-training--598.org.readthedocs.build/en/598/
📚 Documentation preview 📚: https://anemoi-graphs--598.org.readthedocs.build/en/598/
📚 Documentation preview 📚: https://anemoi-models--598.org.readthedocs.build/en/598/