Skip to content

Conversation

b8raoult
Copy link
Collaborator

@b8raoult b8raoult commented Oct 7, 2025

Description

This PR will allow tracking of the origins of variables, previously only the mars request was clearly stored:

  • When the dataset is created: which source was used (mars, netcdf, etc), which filters were used (rename, regrid, clip, ...)
  • When the dataset is used in training (combining datasets, subsetting, cutout, ...) were performed.

The outcome is available calling ds.metadata()

Example:

from anemoi.datasets import open_dataset

ds = open_dataset(
    [
        {
            "dataset": "aifs-od-an-oper-0001-mars-n320-2016-2023-6h-v6",
            "end": 2023,
            "frequency": "6h",
        },
        {
            "dataset": "aifs-od-an-oper-0001-mars-n320-2016-2023-6h-v2-precipitations",
            "end": 2023,
            "frequency": "6h",
            "rename": {"tp_0h_6h": "tp"},
            "select": ["tp_0h_6h"],
        },
    ],
    end=2023,
)

Will return for tp

{
    "type": "pipe",
    "when": "dataset-usage",
    "steps": [
        {
            "steps": [
                {
                    "config": {
                        "accumulation_period": 6,
                        "grid": "n320",
                        "levtype": "sfc",
                        "param": ["tp", "cp", "sf"],
                    },
                    "name": "accumulations",
                    "type": "source",
                    "when": "dataset-create",
                },
                {
                    "config": {"param": "{param}_{startStep}h_{endStep}h"},
                    "name": "rename",
                    "type": "filter",
                    "when": "dataset-create",
                },
            ],
            "type": "pipe",
            "when": "dataset-create",
        },
        {
            "name": "rename",
            "config": {"rename": {"tp_0h_6h": "tp"}},
            "when": "dataset-usage",
            "type": "filter",
        },
    ],
}

What problem does this change solve?

What issue or task does this change relate to?

Additional notes

As a contributor to the Anemoi framework, please ensure that your changes include unit tests, updates to any affected dependencies and documentation, and have been tested in a parallel setting (i.e., with multiple GPUs). As a reviewer, you are also responsible for verifying these aspects and requesting changes if they are not adequately addressed. For guidelines about those please refer to https://anemoi.readthedocs.io/en/latest/

By opening this pull request, I affirm that all authors agree to the Contributor License Agreement.


📚 Documentation preview 📚: https://anemoi-datasets--437.org.readthedocs.build/en/437/

@b8raoult b8raoult requested a review from a team as a code owner October 7, 2025 17:56
@github-project-automation github-project-automation bot moved this to To be triaged in Anemoi-dev Oct 7, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file tests labels Oct 7, 2025
@HCookie HCookie marked this pull request as draft October 7, 2025 18:26
@b8raoult b8raoult marked this pull request as ready for review October 7, 2025 20:18
@b8raoult b8raoult changed the title Feat/origin feat: track origins of variable from dataset creation to inference Oct 7, 2025
@github-actions github-actions bot added bug Something isn't working enhancement New feature or request labels Oct 8, 2025
Copy link
Contributor

@aaron-hopkinson aaron-hopkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a few unrelated changes in this PR. It would be good to isolate this to just the changes required for origin tracking in order to make it easier to review.

Some of the new classes, particularly the Origin related ones and the Projection related ones could do with some unit tests to help make clear how they're supposed to work. At the moment, the code is a bit hard to follow without the additional context. Informative type hints might also help too.

"anemoi-utils[provenance]>=0.4.32",
"cfunits",
"glom",
"jsonschema",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this being removed?


return GroupOfDates(sorted(set(group_of_dates) & set(filtering_dates)), group_of_dates.provider)

def origin(self, data: Any, action: Any, action_arguments: Any) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have type hints, can they be more specific? e.g. action must be an object supporting the origin method.

previous = fs.metadata("anemoi_origin", default=None)
fall_through = fs.metadata("anemoi_fall_through", default=False)
if fall_through:
# The field has pass unchanges in a filter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# The field has pass unchanges in a filter
# The field has passed unchanged through a filter

return _data_request(self.datasource)

@property
def origins(self) -> dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def origins(self) -> dict[str, Any]:
def origins(self) -> dict[str, list]:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Unless self._origins eventually becomes a different object)

@property
def origins(self) -> dict[str, Any]:
"""Returns a dictionary with the parameters needed to retrieve the data."""
return {"version": 1, "origins": self._origins}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we anticipate that this will need to be a versioned object of some kind? Do we really need it, or will it just make things more complicated?

def from_slices(cls, slices):
return Projection(slices)

@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both this and the method above don't use cls. Would they be better as staticmethods, or do you envisage these depending on the class in the future?

-------
Iterator[datetime.datetime]
An iterator of datetime objects.
Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes seem unrelated to the origins work – move to a separate PR?

Returns:
Union[str, List[str]]: The shortened list of dates.
Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above - unrelated to this work

# TODO: Use the one from anemoi.utils.grids instead
# from anemoi.utils.grids import ...
from scipy.spatial import cKDTree
from scipy.spatial import KDTree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to origins tracking

@zarr_tests
@not_ready
def test_class_missing_dates_fill():
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of empty tests here - maybe better to create an issue for them rather than having them in here where they'll be forgotten about

@github-project-automation github-project-automation bot moved this from To be triaged to Under Review in Anemoi-dev Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ATS approval needed bug Something isn't working dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request tests

Projects

Status: Under Review

Development

Successfully merging this pull request may close these issues.

3 participants