Skip to content

fix(training)!: Refactor configuration by introducing system schema with hardware, paths, and files subschemas#598

Merged
anaprietonem merged 103 commits intomainfrom
hardware_config
Nov 26, 2025
Merged

fix(training)!: Refactor configuration by introducing system schema with hardware, paths, and files subschemas#598
anaprietonem merged 103 commits intomainfrom
hardware_config

Conversation

@matschreiner
Copy link
Contributor

@matschreiner matschreiner commented Oct 10, 2025

Description

This PR reorganizes the configuration structure by introducing a new top-level schema called system, which groups the subschemas hardware, storage, and files (see issue #513).

Previously, paths and files were defined under hardware, which was confusing since they describe filesystem layout and not than compute resources.
It also wasn’t clear whether paths refer to directories or files. It has now been renamed to storage, clarifying its role as the definition of directory structure for inputs, outputs, and logs.

system/
├── hardware.yaml
├── input.yaml
└── output.yaml

The PR also isolates the concatenation logic for paths in the pydantic scheme so we don't need to write out the full paths for all outputs/logs/etc in each field. This is very brittle and used to happen in both code throughout the framework and inside the nested configuration files. This is now isolated to happen in only one place.


📚 Documentation preview 📚: https://anemoi-training--598.org.readthedocs.build/en/598/


📚 Documentation preview 📚: https://anemoi-graphs--598.org.readthedocs.build/en/598/


📚 Documentation preview 📚: https://anemoi-models--598.org.readthedocs.build/en/598/

@matschreiner matschreiner changed the title System config fix(training): Refactor configuration introduce system config with hardware, paths, and files subschemas Oct 10, 2025
@matschreiner matschreiner changed the title fix(training): Refactor configuration introduce system config with hardware, paths, and files subschemas fix(training): Refactor configuration by introducing system schema with hardware, paths, and files subschemas Oct 10, 2025
@matschreiner matschreiner requested a review from Copilot October 10, 2025 19:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the configuration structure by introducing a new top-level system schema that groups hardware, paths, and files subschemas. Previously, paths and files were defined under hardware, which was confusing since they describe filesystem layout rather than compute resources. The new structure separates hardware configuration from I/O configuration, making the hierarchy more intuitive.

  • Moved paths and files from hardware to new system schema
  • Created separate hardware subschema specifically for compute resources
  • Updated all references throughout codebase from config.hardware.* to config.system.*

Reviewed Changes

Copilot reviewed 48 out of 48 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
training/src/anemoi/training/schemas/system.py Introduces new SystemSchema with hardware, files, and paths subschemas
training/src/anemoi/training/schemas/base_schema.py Updates schema to use SystemSchema instead of HardwareSchema
training/src/anemoi/training/train/train.py Updates all config references to use new system schema structure
training/src/anemoi/training/config/hardware/* Restructures hardware config files to separate compute from I/O settings
tests and config files Updates all configuration references to use new system schema structure

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

matschreiner and others added 2 commits October 10, 2025 21:51
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@mchantry mchantry added the ATS Approval Needed Approval needed by ATS label Oct 13, 2025
@anaprietonem anaprietonem changed the title fix(training): Refactor configuration by introducing system schema with hardware, paths, and files subschemas fix(training)!: Refactor configuration by introducing system schema with hardware, paths, and files subschemas Oct 15, 2025
@dietervdb-meteo
Copy link
Contributor

I like this, but not sure if 'storage' captures what used to be described under 'paths'. Maybe 'directories' or... ?

Copy link
Contributor

@dietervdb-meteo dietervdb-meteo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As previously, LGTM

@anaprietonem anaprietonem merged commit da02fe7 into main Nov 26, 2025
19 checks passed
@github-project-automation github-project-automation bot moved this from Under Review to Done in Anemoi-dev Nov 26, 2025
@anaprietonem anaprietonem deleted the hardware_config branch November 26, 2025 12:13
@DeployDuck DeployDuck mentioned this pull request Nov 26, 2025
JPXKQX added a commit that referenced this pull request Dec 2, 2025
## Description
<!-- What issue or task does this change relate to? -->

This PR removes unused code:

- ensdatamodule.py reintroduced in #598 
- some references introduced in #651 that are not used inside the
datamodule
- others: `n_samples_per_epoch_total` and
`n_samples_per_epoch_per_worker`.


***As a contributor to the Anemoi framework, please ensure that your
changes include unit tests, updates to any affected dependencies and
documentation, and have been tested in a parallel setting (i.e., with
multiple GPUs). As a reviewer, you are also responsible for verifying
these aspects and requesting changes if they are not adequately
addressed. For guidelines about those please refer to
https://anemoi.readthedocs.io/en/latest/***

By opening this pull request, I affirm that all authors agree to the
[Contributor License
Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md)
@frazane
Copy link
Contributor

frazane commented Dec 4, 2025

Are changes from this PR reflected in the user documentation or is anybody working on it? I am still seeing the old "hardware" section referenced in the doc.

@dietervdb-meteo
Copy link
Contributor

Very good point... I overlooked that when reviewing, sorry. So used to go and look for config changes in the default configs inside the code rather than the docs...

@dietervdb-meteo
Copy link
Contributor

@frazane , can you make an issue?

@JPXKQX
Copy link
Member

JPXKQX commented Dec 4, 2025

I have already opened an issue (#703 ), and started to document some of these

JPXKQX pushed a commit that referenced this pull request Dec 5, 2025
🤖 Automated Release PR

This PR was created by `release-please` to prepare the next release.
Once merged:

1. A new version tag will be created
2. A GitHub release will be published
3. The changelog will be updated

Changes to be included in the next release:
---


<details><summary>training: 0.8.0</summary>

##
[0.8.0](training-0.7.0...training-0.8.0)
(2025-12-05)


### ⚠ BREAKING CHANGES

* **training:** Refactor configuration by introducing system schema with
hardware, paths, and files subschemas
([#598](#598))
* cond layer norm
([#658](#658))

### Features

* Activate minmium plotting for integration tests
([#669](#669))
([84e5882](84e5882))
* Compile transformer gnn
([#181](#181))
([24d162c](24d162c))
* **models:** Add configurable residual connections in enc-proc-dec
([#670](#670))
([aeaf00b](aeaf00b))
* **models:** Triton GraphTransformer
([#631](#631))
([b40b6c6](b40b6c6))
* Time_interpolator_callbacks
([#677](#677))
([c2b8179](c2b8179))
* **training:** Performance docs
([#696](#696))
([9574ff1](9574ff1))
* **training:** Refactor optimizer creation to support custom and torch
optimizers ([#588](#588))
([cd777fb](cd777fb))


### Bug Fixes

* Add package config path to Hydra search path in plugin
([#656](#656))
([ca6f732](ca6f732))
* Cond layer norm
([#658](#658))
([7315e3a](7315e3a))
* **logger:** Bugs in AzureMLFlowLogger from
[#646](#646)
([#685](#685))
([14c0235](14c0235))
* **models:** Processor chunking
([#629](#629))
([06e5533](06e5533))
* Pass weights_only for pytorch lightning &gt;= 2.6.0
([#713](#713))
([7446942](7446942))
* Ptl 2.6.0 explicitly pass weights_only=False
([#710](#710))
([e18824c](e18824c))
* RolloutEval sharding
([#714](#714))
([0fbc071](0fbc071))
* Slurm system config
([#702](#702))
([cce8763](cce8763))
* Target docs ([#704](#704))
([200101e](200101e))
* **training,tasks:** Abstract RolloutForecasting task
([#682](#682))
([f14fc32](f14fc32))
* **training:** CombinedLoss schema validation
([#719](#719))
([dba4268](dba4268))
* **training:** Refactor configuration by introducing system schema with
hardware, paths, and files subschemas
([#598](#598))
([da02fe7](da02fe7))
* **training:** Remove unused code
([#706](#706))
([f49813a](f49813a))
</details>

<details><summary>graphs: 0.8.0</summary>

##
[0.8.0](graphs-0.7.2...graphs-0.8.0)
(2025-12-05)


### ⚠ BREAKING CHANGES

* **edges:** Edge feature revision #643
([#727](#727))

### Features

* **edges:** Edge feature revision
([#643](#643))
([720f4d8](720f4d8))
* **edges:** Edge feature revision
[#643](#643)
([#727](#727))
([d1372cf](d1372cf))
* **graphs:** Support for multi-scale connections with HEALPix hidden
grid ([#691](#691))
([1450787](1450787))


### Bug Fixes

* Revert "feat(edges): Edge feature revision"
([#726](#726))
([db1f940](db1f940))
* Sparse export
([#686](#686))
([969b787](969b787))
* Target docs ([#704](#704))
([200101e](200101e))
</details>

<details><summary>models: 0.11.0</summary>

##
[0.11.0](models-0.10.0...models-0.11.0)
(2025-12-05)


### ⚠ BREAKING CHANGES

* **training:** Refactor configuration by introducing system schema with
hardware, paths, and files subschemas
([#598](#598))
* cond layer norm
([#658](#658))

### Features

* Compile transformer gnn
([#181](#181))
([24d162c](24d162c))
* **models:** Add configurable residual connections in enc-proc-dec
([#670](#670))
([aeaf00b](aeaf00b))
* **models:** Multibackend all_to_all wrapper
([#95](#95))
([6819be1](6819be1))
* **models:** Triton GraphTransformer
([#631](#631))
([b40b6c6](b40b6c6))


### Bug Fixes

* Compile pickle error
([#708](#708))
([f4fc4ab](f4fc4ab))
* Cond layer norm
([#658](#658))
([7315e3a](7315e3a))
* **models:** Processor chunking
([#629](#629))
([06e5533](06e5533))
* Predict_step shard shapes
([#692](#692))
([be9ff8b](be9ff8b))
* Remove import of anemoi training in compile
([#705](#705))
([f7d5ae4](f7d5ae4))
* Small pytorch boxcox inefficiency
([#683](#683))
([66b40e0](66b40e0))
* **training:** Refactor configuration by introducing system schema with
hardware, paths, and files subschemas
([#598](#598))
([da02fe7](da02fe7))
</details>

---
> [!IMPORTANT]
> Please do not change the PR title, manifest file, or any other
automatically generated content in this PR unless you understand the
implications. Changes here can break the release process.
> 
> ⚠️ Merging this PR will:
> - Create a new release
> - Trigger deployment pipelines
> - Update package versions

 **Before merging:**
 - Ensure all tests pass
 - Review the changelog carefully
 - Get required approvals

[Release-please
documentation](https://github.com/googleapis/release-please)
mpvginde pushed a commit that referenced this pull request Dec 8, 2025
…ith hardware, paths, and files subschemas (#598)

# Description
This PR reorganizes the configuration structure by introducing a new
top-level schema called system, which groups the subschemas hardware,
storage, and files (see [issue
#513](#513)).

Previously, paths and files were defined under hardware, which was
confusing since they describe filesystem layout and not than compute
resources.
It also wasn’t clear whether paths refer to directories or files. It has
now been renamed to storage, clarifying its role as the definition of
directory structure for inputs, outputs, and logs.

```
system/
├── hardware.yaml
├── input.yaml
└── output.yaml
```

The PR also isolates the concatenation logic for paths in the pydantic
scheme so we don't need to write out the full paths for all
outputs/logs/etc in each field. This is very brittle and used to happen
in both code throughout the framework and inside the nested
configuration files. This is now isolated to happen in only one place.

<!-- readthedocs-preview anemoi-training start -->
----
📚 Documentation preview 📚:
https://anemoi-training--598.org.readthedocs.build/en/598/

<!-- readthedocs-preview anemoi-training end -->

<!-- readthedocs-preview anemoi-graphs start -->
----
📚 Documentation preview 📚:
https://anemoi-graphs--598.org.readthedocs.build/en/598/

<!-- readthedocs-preview anemoi-graphs end -->

<!-- readthedocs-preview anemoi-models start -->
----
📚 Documentation preview 📚:
https://anemoi-models--598.org.readthedocs.build/en/598/

<!-- readthedocs-preview anemoi-models end -->

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: anaprietonem <ana.prietonemesio@ecmwf.int>
Co-authored-by: Ana Prieto Nemesio <91897203+anaprietonem@users.noreply.github.com>
Co-authored-by: Dieter Van den Bleeken <dieter.vandenbleeken@meteo.be>
Co-authored-by: Mario Santa Cruz <48736305+JPXKQX@users.noreply.github.com>
mpvginde pushed a commit that referenced this pull request Dec 8, 2025
## Description
<!-- What issue or task does this change relate to? -->

This PR removes unused code:

- ensdatamodule.py reintroduced in #598 
- some references introduced in #651 that are not used inside the
datamodule
- others: `n_samples_per_epoch_total` and
`n_samples_per_epoch_per_worker`.


***As a contributor to the Anemoi framework, please ensure that your
changes include unit tests, updates to any affected dependencies and
documentation, and have been tested in a parallel setting (i.e., with
multiple GPUs). As a reviewer, you are also responsible for verifying
these aspects and requesting changes if they are not adequately
addressed. For guidelines about those please refer to
https://anemoi.readthedocs.io/en/latest/***

By opening this pull request, I affirm that all authors agree to the
[Contributor License
Agreement.](https://github.com/ecmwf/codex/blob/main/Legal/contributor_license_agreement.md)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants