Skip to content

feat: Make anemoi-datasets agnostic to Zarr version (Optional support Zarr3)#220

Merged
floriankrb merged 8 commits intomainfrom
feature/zarr3
Mar 31, 2026
Merged

feat: Make anemoi-datasets agnostic to Zarr version (Optional support Zarr3)#220
floriankrb merged 8 commits intomainfrom
feature/zarr3

Conversation

@floriankrb
Copy link
Copy Markdown
Member

@floriankrb floriankrb commented Feb 26, 2025

Description

Anemoi-datasets should be agnostic to the version of zarr. It should run with zarr3 installed or with zarr2 installed.

We should also take into account that zarr2 code cannot read datasets created by zarr3. So, ideally, we should
1 - update anemoi-datasets to work with zarr2 and zarr3
2 - keep using zarr2 to build datasets (dependency anemoi-datasets[create] on zarr<=2, but have zarr2 or 3 when reading (dependency of anemoi-datasets on zarr
3 - when user have updated their environment (6 months?), start building zarr3 datasets

This PR addresses the first point, making anemoi-datasets detect the version of zarr and adapt to it.
We still pin the version to zarr 2 in the pyproject.toml because of performance issues with zarr version 3.


📚 Documentation preview 📚: https://anemoi-datasets--220.org.readthedocs.build/en/220/

@github-actions github-actions bot added tests enhancement New feature or request labels Feb 26, 2025
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.85%. Comparing base (80de4c6) to head (1459e83).
Report is 62 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #220      +/-   ##
==========================================
+ Coverage   72.96%   73.85%   +0.88%     
==========================================
  Files          10       10              
  Lines         825      872      +47     
==========================================
+ Hits          602      644      +42     
- Misses        223      228       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Feb 27, 2025
@floriankrb floriankrb marked this pull request as ready for review March 12, 2025 15:59
@floriankrb floriankrb marked this pull request as draft March 28, 2025 09:17
@anaprietonem
Copy link
Copy Markdown
Contributor

anaprietonem commented Jun 3, 2025

@floriankrb what was the conclusion in terms of adding/updating to zarr 3?

@floriankrb
Copy link
Copy Markdown
Member Author

@anaprietonem I updated the description of the PR

@frazane
Copy link
Copy Markdown
Contributor

frazane commented Jun 25, 2025

Just adding a small comment here. You've probably considered this already but here we go.

The transition to Zarr v3 specification comes with a nice opportunity: we could use the sharding feature! It would allow us to have chunked variables (avoiding having to load them all in memory before taking a subset) while still retaining very high read speeds. Basically we would make each timestamp a shard, and then chunk variables inside the shard. The downside is slower write speed, but it’s not very important.

More info:

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 26, 2025
@b8raoult
Copy link
Copy Markdown
Collaborator

b8raoult commented Jun 26, 2025

All tests are now passing with zarr2 and zarr3. At the moment of writing, zarr3 still does not support datetime64 (zarr-developers/zarr-python#2616)

Tests are much more slower with zarr3, this will require more investigation. For example test_slice_4 takes 98s with zarr2 (1.5 minutes) and 612s with zarr3 (10 minutes)

The profiler shows for zarr2:

   773147    0.339    0.000   95.489    0.000 .../anemoi-datasets/src/anemoi/datasets/data/stores.py:148(__getitem__)
   773165    1.553    0.000   95.152    0.000 .../python3.12/site-packages/zarr/core.py:656(__getitem__)

and for zarr3:

   773147    0.449    0.000  615.551    0.001 .../anemoi-datasets/src/anemoi/datasets/data/stores.py:148(__getitem__)
   773165    2.088    0.000  615.118    0.001 .../python3.12/site-packages/zarr/core/array.py:2294(__getitem__)

The test uses an in-memory zarr store. The test calls __getitem__ 770k time. With a time per call of 0.001s, we have 0.001 x 770000 is 770, which is in the ballpark of 615.

Copy link
Copy Markdown

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@floriankrb @b8raoult this is fantastic, thank you for the hard work! We will try it this week or next and get back to you.

@mchantry mchantry moved this from Now In Progress to On Pause in Anemoi-dev Sep 17, 2025
@anaprietonem anaprietonem changed the title feat: zarr3 feat: Make anemoi-datasets agnostic to Zarr version Feb 5, 2026
@anaprietonem anaprietonem changed the title feat: Make anemoi-datasets agnostic to Zarr version feat: Make anemoi-datasets agnostic to Zarr version (Optional support Zarr3) Feb 5, 2026
@TomNicholas
Copy link
Copy Markdown

TomNicholas commented Feb 5, 2026

Do you guys need any help with this from the Zarr side? We have a lot of people wanting to use Anemoi with Zarr v3 (including the ones in #290).

We should also take into account that zarr2 code cannot read datasets created by zarr3.
So, ideally, we should
1 - update anemoi-datasets to work with zarr2 and zarr3

I'm not sure I understand why this is necessary, as zarr-python v3 can read and write zarr format version 2 and 3 data if you want. (We do this in xarray for example - you can use xarray with zarr-python v3 to read zarr format 2 data.)

At the moment of writing, zarr3 still does not support datetime64

It does now!

Tests are much more slower with zarr3, this will require more investigation.

This was investigated in zarr-developers/zarr-python#3524 (comment) and seemed to satisfy you @b8raoult - is it still a blocker?

Also bear in mind that zarr-python 3 has been out for quite a while now - long enough in fact that xarray is seriously considering dropping support for zarr-python v2 (in accordance with SPEC0). Don't get left behind! 🙂

@anaprietonem
Copy link
Copy Markdown
Contributor

anaprietonem commented Feb 6, 2026

Hey @TomNicholas, thanks for the interest! Indeed, we’re aware of the growing interest in Zarr v3, and that’s what this PR is aiming to address.

Where we are at the moment is that some changes are needed to enable Zarr v3 support in anemoi-datasets, those included in this PR, but also things like dropping support for Python 3.10. We already have many datasets built with Zarr v2, and it’s important for us to benchmark runtime performance for training workloads. We’ve done some analysis here: #486, feel free to take a look and share your thoughts. Our tests on HPC systems show some degradation in runtime performance when moving to the Zarr v3 Python package.

We’re still working on gathering more data points and exploring ways to improve this. This was discussed with @ecmwf/anemoi_technical_subgroup, and the decision was to merge Zarr v3 support into main in anemoi-datasets, but not yet remove the pyproject constraint on Zarr v2. This way, users who want to experiment with Zarr v3 can easily force it, while anemoi-datasets officially supports it. If we find ways to further optimise performance, we can simply update the pyproject constraints in the future.

@floriankrb @b8raoult @cathalobrien feel free to add to this!

@b8raoult
Copy link
Copy Markdown
Collaborator

b8raoult commented Feb 7, 2026

To complement @anaprietonem's message. We have a very thin compatibility wrapper around zarr2 and zarr3 in our code, for apis that have changed. This wrapper will use whatever version of zarr is installed. By default we mention version 2 in the dependencies, but a user can force a pip install of version 3, and the code will still work, with the caveat that zarr2 cannot read zarr3 datasets. Once we fully adopt zarr3, we will remove that wrapper and call the zarr api directly.

@b8raoult
Copy link
Copy Markdown
Collaborator

b8raoult commented Feb 7, 2026

@TomNicholas, as you mention it in your comment above, is the issue of indexing speed for memory only zarrs described in zarr-developers/zarr-python#3524 fixed? The ticket is still open and I have not tested again.

@TomNicholas
Copy link
Copy Markdown

I don't think anyone has worked on that yet, no.

But is that a blocker for you? Your response here

You may close this issue if you wish.

implied that it is not a blocker. But if it is a blocker then we would like to help unblock you.

Copy link
Copy Markdown
Contributor

@anaprietonem anaprietonem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR LGTM to get it in to facilitate the transition to zarr3.

@floriankrb
Copy link
Copy Markdown
Member Author

Downstream CI is failing due to a timeout on data.ecmwf.int, tests run separately work. Merging.

@floriankrb floriankrb merged commit ab8cd71 into main Mar 31, 2026
362 of 374 checks passed
@floriankrb floriankrb deleted the feature/zarr3 branch March 31, 2026 14:29
@github-project-automation github-project-automation bot moved this from On Pause to Done in Anemoi-dev Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ATS Approval not needed dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request tests

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants