Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
47f650b
feat: missing features for observations
floriankrb May 28, 2025
eeeedcd
change window on the fly
floriankrb Jun 2, 2025
ecba1db
implement netcdf backend
floriankrb Jun 4, 2025
8ccd237
up
floriankrb Jun 4, 2025
e2b54ad
up
floriankrb Jun 6, 2025
fcd46a3
up
floriankrb Jun 10, 2025
a02deb3
fix
floriankrb Jun 10, 2025
48df56e
creating obs dataset. Draft
floriankrb Jun 10, 2025
4823099
creating obs dataset. Draft
floriankrb Jun 10, 2025
89fe751
feat(obs): add set_group and rename
JPXKQX Jun 12, 2025
4e6d877
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 12, 2025
2bd2567
fieldsrecords and set_group. and some refactor. may not be backward c…
floriankrb Jun 11, 2025
ab56402
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 12, 2025
111bcc8
adding odb mars draft example
pinnstorm Jun 17, 2025
5738660
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 17, 2025
3680c6b
adding odb mars draft example
pinnstorm Jun 17, 2025
227f815
adding odb mars draft example
pinnstorm Jun 17, 2025
bc70153
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 17, 2025
a195b03
adding needed metadata file
pinnstorm Jun 17, 2025
9bd5ca5
mering branch
pinnstorm Jun 17, 2025
1f9f9aa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 17, 2025
ebfe5af
Merge branch 'main' into feature/missing-features-for-observations
floriankrb Jun 17, 2025
a3e72f7
qa
floriankrb Jun 17, 2025
2b01564
updating mars example
pinnstorm Jun 18, 2025
9bb6c66
merging in remote
pinnstorm Jun 18, 2025
8d6ccd6
updating mars odb example
pinnstorm Jun 19, 2025
3c38f5f
Merge branch 'main' into feature/missing-features-for-observations
floriankrb Jun 23, 2025
6b2adb8
more metadata
floriankrb Jun 23, 2025
dcc1802
fix timedelta type
floriankrb Jun 24, 2025
826dff8
more logs
floriankrb Jun 24, 2025
d4be463
typo
floriankrb Jun 24, 2025
552c797
update 2025.06.24
floriankrb Jun 24, 2025
0053aec
up
floriankrb Jun 30, 2025
7c2d4fb
padding="missing" or "raise"
floriankrb Jul 1, 2025
031e9f2
DOP dataset: first draft, missing len, sample factory index issue
mishooax Jul 2, 2025
5bebd27
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 2, 2025
fbb4121
remove codes that does not belong here
floriankrb Jul 2, 2025
e6ecbc0
up
floriankrb Jul 2, 2025
ad61f65
bring lats and lons
JPXKQX Jul 3, 2025
ea2f7d9
update timedelta to 0s for field records
JPXKQX Jul 3, 2025
c10ea85
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 3, 2025
e565866
add metadata
JPXKQX Jul 3, 2025
6faeb15
Merge branch 'feature/missing-features-for-observations' of https://g…
JPXKQX Jul 3, 2025
348d43a
Merge branch 'main' into feature/missing-features-for-observations
floriankrb Jul 7, 2025
98dae84
revert inspect
floriankrb Jul 7, 2025
a9816ca
fix
floriankrb Jul 16, 2025
20fc185
fix padding
floriankrb Jul 17, 2025
558f1a6
fix: _select with set_group
JPXKQX Aug 5, 2025
aa8e16d
Merge branch 'main' into feature/missing-features-for-observations
floriankrb Aug 20, 2025
f2615d6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Aug 20, 2025
1d94d5a
up
floriankrb Sep 5, 2025
6941fdf
add type hint
JPXKQX Sep 5, 2025
9ef1fc3
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 5, 2025
4f5acbb
adding bufr examples to obs-data building
pinnstorm Sep 9, 2025
23e8a5f
merging in remote
pinnstorm Sep 9, 2025
8b6b765
added some comments
floriankrb Sep 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions src/anemoi/datasets/create/sources/observations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# (C) Copyright 2025 Anemoi contributors.
#
# This software is licensed under the terms of the Apache Licence Version 2.0
# which can be obtained at http://www.apache.org/licenses/LICENSE-2.0.
#
# In applying this licence, ECMWF does not waive the privileges and immunities
# granted to it by virtue of its status as an intergovernmental organisation
# nor does it submit to any jurisdiction.


import pandas as pd


def check_dataframe(df):
"""Check the DataFrame for consistency."""
if df.empty:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if df.empty I guess you don't want to do the checks below? If so, should this be return?

if "times" not in df.columns:
raise ValueError("The DataFrame must contain a 'times' column.")
if not pd.api.types.is_datetime64_any_dtype(df["times"]):
raise TypeError("The 'times' column must be of datetime type.")
if "latitudes" not in df.columns or "longitudes" not in df.columns:
raise ValueError("The DataFrame must contain 'latitudes' and 'longitudes' columns.")


class ObservationsSource:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan for ObservationsSource and ObservationsFilter to implement similar interfaces as the field equivalents?, e.g. taking in datetimes (or windows in this case?) and returning e.g. dataframes?

def __call__(self, window):
raise NotImplementedError("This method should be implemented by subclasses")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make an @abstractmethod?


def _check(self, df):
check_dataframe(df)
return df


class ObservationsFilter:
def __call__(self, df):
"""Filter the data based on the given window."""
check_dataframe(df)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call self._check(df)?

return df

def _check(self, df):
check_dataframe(df)
return df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and above, should we return df? (I don't mind – just wonder if it's surprising)

17 changes: 13 additions & 4 deletions src/anemoi/datasets/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,8 @@ def _subset(self, **kwargs: Any) -> "Dataset":
if not kwargs:
return self.mutate()

name = kwargs.pop("name", None)
name = kwargs.pop("set_group", None) # TODO(Florian)
name = kwargs.pop("name", name)
result = self.__subset(**kwargs)
result._name = name

Expand Down Expand Up @@ -177,13 +178,18 @@ def __subset(self, **kwargs: Any) -> "Dataset":
padding = kwargs.pop("padding", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a padding? (type str? just one of "empty", "raise", "missing"?). Is it worth a type hint/docstring if it can only accept these literals?


if padding:
if padding != "empty":
raise ValueError(f"Only 'empty' padding is supported, got {padding=}")
from .padded import Padded

frequency = kwargs.pop("frequency", self.frequency)
return (
Padded(self, start, end, frequency, dict(start=start, end=end, frequency=frequency))
Padded(
self,
start=start,
end=end,
frequency=frequency,
padding=padding,
reason=dict(start=start, end=end, frequency=frequency, padding=padding),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the need for the reason? Could we derive this from the other arguments? Do we envisage a use-case where it can't be derived from the arguments?

)
._subset(**kwargs)
.mutate()
)
Expand Down Expand Up @@ -404,6 +410,9 @@ def _select_to_columns(self, vars: str | list[str] | tuple[str] | set) -> list[i
if not isinstance(vars, (list, tuple)):
vars = [vars]

for v in vars:
if v not in self.name_to_index:
raise ValueError(f"select: unknown variable: {v}, available: {list(self.name_to_index)}")
return [self.name_to_index[v] for v in vars]

def _drop_to_columns(self, vars: str | Sequence[str]) -> list[int]:
Expand Down
1 change: 1 addition & 0 deletions src/anemoi/datasets/data/debug.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ def __init__(self, dataset: "Dataset", kids: list[Any], **kwargs: Any) -> None:
Additional keyword arguments.
"""
self.dataset = dataset
assert isinstance(kids, list), "Kids must be a list"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are other iterables acceptable?

self.kids = kids
self.kwargs = kwargs

Expand Down
43 changes: 29 additions & 14 deletions src/anemoi/datasets/data/misc.py
Original file line number Diff line number Diff line change
Expand Up @@ -349,19 +349,7 @@ def _open(a: str | PurePath | dict[str, Any] | list[Any] | tuple[Any, ...]) -> "
"""
from .dataset import Dataset
from .stores import Zarr
from .stores import zarr_lookup

if isinstance(a, str) and len(a.split(".")) in [2, 3]:

metadata_path = os.path.join(a, "metadata.json")
if os.path.exists(metadata_path):
metadata = load_any_dict_format(metadata_path)
if "backend" not in metadata:
raise ValueError(f"Metadata for {a} does not contain 'backend' key")

from anemoi.datasets.data.records import open_records_dataset

return open_records_dataset(a, backend=metadata["backend"])
from .stores import dataset_lookup

if isinstance(a, Dataset):
return a.mutate()
Expand All @@ -370,7 +358,22 @@ def _open(a: str | PurePath | dict[str, Any] | list[Any] | tuple[Any, ...]) -> "
return Zarr(a).mutate()

if isinstance(a, str):
return Zarr(zarr_lookup(a)).mutate()
path = dataset_lookup(a)

if path and path.endswith(".zarr") or path.endswith(".zip"):
return Zarr(path).mutate()

if path and path.endswith(".vz"):
metadata_path = os.path.join(path, "metadata.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like all the checking here is the responsibility of open_records_dataset rather than being in this function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this code is actually in that function - we should remove the duplication

if os.path.exists(metadata_path):
if "backend" not in load_any_dict_format(metadata_path):
raise ValueError(f"Metadata for {path} does not contain 'backend' key")

from anemoi.datasets.data.records import open_records_dataset

return open_records_dataset(path)

raise ValueError(f"Unsupported dataset path: {path}. ")

if isinstance(a, PurePath):
return _open(str(a)).mutate()
Expand Down Expand Up @@ -587,6 +590,18 @@ def _open_dataset(*args: Any, **kwargs: Any) -> "Dataset":

assert len(sets) > 0, (args, kwargs)

if "set_group" in kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "set_group"? Why does this result in returning FieldsRecords?

from anemoi.datasets.data.records import FieldsRecords

set_group = kwargs.pop("set_group")
assert len(sets) == 1, "set_group can only be used with a single dataset"
dataset = sets[0]

from anemoi.datasets.data.dataset import Dataset

if isinstance(dataset, Dataset): # Fields dataset
return FieldsRecords(dataset, **kwargs, name=set_group).mutate()

if len(sets) > 1:
dataset, kwargs = _concat_or_join(sets, kwargs)
return dataset._subset(**kwargs)
Expand Down
6 changes: 5 additions & 1 deletion src/anemoi/datasets/data/observations/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,10 @@ def __len__(self):
return len(self.dates)

def tree(self):
return Node(self)
return Node(
self,
[],
)

def __getitem__(self, i):
if isinstance(i, int):
Expand Down Expand Up @@ -232,6 +235,7 @@ def get_aux(self, i):
assert latitudes.shape == longitudes.shape, f"Expected {latitudes.shape}, got {longitudes.shape}"
assert timedeltas.shape == latitudes.shape, f"Expected {timedeltas.shape}, got {latitudes.shape}"

assert timedeltas.dtype == "timedelta64[s]", f"Expected timedelta64[s], got {timedeltas.dtype}"
return latitudes, longitudes, timedeltas

def getitem(self, i):
Expand Down
26 changes: 22 additions & 4 deletions src/anemoi/datasets/data/padded.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from anemoi.utils.dates import frequency_to_timedelta
from numpy.typing import NDArray

from anemoi.datasets.data import MissingDateError
from anemoi.datasets.data.dataset import Dataset
from anemoi.datasets.data.dataset import FullIndex
from anemoi.datasets.data.dataset import Shape
Expand All @@ -36,7 +37,15 @@ class Padded(Forwards):
_after: int = 0
_inside: int = 0

def __init__(self, dataset: Dataset, start: str, end: str, frequency: str, reason: dict[str, Any]) -> None:
def __init__(
self,
dataset: Dataset,
start: str,
end: str,
frequency: str,
reason: Dict[str, Any],
padding: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB: Padding added but not in docstring (could contain literal values if they're what we are expecting)

) -> None:
"""Create a padded subset of a dataset.

Attributes:
Expand All @@ -46,6 +55,7 @@ def __init__(self, dataset: Dataset, start: str, end: str, frequency: str, reaso
frequency (str): The frequency of the subset.
reason (Dict[str, Any]): The reason for the padding.
"""
self.padding = padding

self.reason = {k: v for k, v in reason.items() if v is not None}

Expand Down Expand Up @@ -164,12 +174,20 @@ def _get_tuple(self, n: TupleIndex) -> NDArray[Any]:
return [self[i] for i in n]

def empty_item(self):
return self.dataset.empty_item()
if self.padding == "empty":
return self.dataset.empty_item()
elif self.padding == "raise":
raise ValueError("Padding is set to 'raise', cannot return an empty item.")
elif self.padding == "missing":
raise MissingDateError("Padding is set to 'missing'")
assert False, self.padding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to have a proper error message


def get_aux(self, i: FullIndex) -> NDArray[np.timedelta64]:
if self._i_out_of_range(i):
arr = np.array([], dtype=np.float32)
aux = arr, arr, arr
lats = np.array([], dtype=np.float32)
lons = lats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a copy?

timedeltas = np.ones_like(lons, dtype="timedelta64[s]") * 0
aux = lats, lons, timedeltas
else:
aux = self.dataset.get_aux(i - self._before)

Expand Down
Loading