feat: Make anemoi-datasets agnostic to Zarr version (Optional support Zarr3)#220
feat: Make anemoi-datasets agnostic to Zarr version (Optional support Zarr3)#220floriankrb merged 8 commits intomainfrom
Conversation
8371f18 to
07edf73
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #220 +/- ##
==========================================
+ Coverage 72.96% 73.85% +0.88%
==========================================
Files 10 10
Lines 825 872 +47
==========================================
+ Hits 602 644 +42
- Misses 223 228 +5 ☔ View full report in Codecov by Sentry. |
7c0f187 to
1459e83
Compare
|
@floriankrb what was the conclusion in terms of adding/updating to zarr 3? |
|
@anaprietonem I updated the description of the PR |
|
Just adding a small comment here. You've probably considered this already but here we go. The transition to Zarr v3 specification comes with a nice opportunity: we could use the sharding feature! It would allow us to have chunked variables (avoiding having to load them all in memory before taking a subset) while still retaining very high read speeds. Basically we would make each timestamp a shard, and then chunk variables inside the shard. The downside is slower write speed, but it’s not very important. More info: |
|
All tests are now passing with zarr2 and zarr3. At the moment of writing, zarr3 still does not support datetime64 (zarr-developers/zarr-python#2616) Tests are much more slower with zarr3, this will require more investigation. For example The profiler shows for zarr2: and for zarr3: The test uses an in-memory zarr store. The test calls |
tjhunter
left a comment
There was a problem hiding this comment.
@floriankrb @b8raoult this is fantastic, thank you for the hard work! We will try it this week or next and get back to you.
cd9231b to
c95780e
Compare
49ef0e2 to
72ae08f
Compare
|
Do you guys need any help with this from the Zarr side? We have a lot of people wanting to use Anemoi with Zarr v3 (including the ones in #290).
I'm not sure I understand why this is necessary, as zarr-python v3 can read and write zarr format version 2 and 3 data if you want. (We do this in xarray for example - you can use xarray with zarr-python v3 to read zarr format 2 data.)
It does now!
This was investigated in zarr-developers/zarr-python#3524 (comment) and seemed to satisfy you @b8raoult - is it still a blocker? Also bear in mind that zarr-python 3 has been out for quite a while now - long enough in fact that xarray is seriously considering dropping support for zarr-python v2 (in accordance with SPEC0). Don't get left behind! 🙂 |
|
Hey @TomNicholas, thanks for the interest! Indeed, we’re aware of the growing interest in Zarr v3, and that’s what this PR is aiming to address. Where we are at the moment is that some changes are needed to enable Zarr v3 support in anemoi-datasets, those included in this PR, but also things like dropping support for Python 3.10. We already have many datasets built with Zarr v2, and it’s important for us to benchmark runtime performance for training workloads. We’ve done some analysis here: #486, feel free to take a look and share your thoughts. Our tests on HPC systems show some degradation in runtime performance when moving to the Zarr v3 Python package. We’re still working on gathering more data points and exploring ways to improve this. This was discussed with @ecmwf/anemoi_technical_subgroup, and the decision was to merge Zarr v3 support into main in anemoi-datasets, but not yet remove the pyproject constraint on Zarr v2. This way, users who want to experiment with Zarr v3 can easily force it, while anemoi-datasets officially supports it. If we find ways to further optimise performance, we can simply update the pyproject constraints in the future. @floriankrb @b8raoult @cathalobrien feel free to add to this! |
|
To complement @anaprietonem's message. We have a very thin compatibility wrapper around zarr2 and zarr3 in our code, for apis that have changed. This wrapper will use whatever version of zarr is installed. By default we mention version 2 in the dependencies, but a user can force a pip install of version 3, and the code will still work, with the caveat that zarr2 cannot read zarr3 datasets. Once we fully adopt zarr3, we will remove that wrapper and call the zarr api directly. |
|
@TomNicholas, as you mention it in your comment above, is the issue of indexing speed for memory only zarrs described in zarr-developers/zarr-python#3524 fixed? The ticket is still open and I have not tested again. |
|
I don't think anyone has worked on that yet, no. But is that a blocker for you? Your response here
implied that it is not a blocker. But if it is a blocker then we would like to help unblock you. |
eda900d to
1bdcc54
Compare
anaprietonem
left a comment
There was a problem hiding this comment.
This PR LGTM to get it in to facilitate the transition to zarr3.
|
Downstream CI is failing due to a timeout on data.ecmwf.int, tests run separately work. Merging. |
Description
Anemoi-datasets should be agnostic to the version of zarr. It should run with zarr3 installed or with zarr2 installed.
We should also take into account that zarr2 code cannot read datasets created by zarr3. So, ideally, we should
1 - update anemoi-datasets to work with zarr2 and zarr3
2 - keep using zarr2 to build datasets (dependency
anemoi-datasets[create]on zarr<=2, but have zarr2 or 3 when reading (dependency ofanemoi-datasetson zarr3 - when user have updated their environment (6 months?), start building zarr3 datasets
This PR addresses the first point, making anemoi-datasets detect the version of zarr and adapt to it.
We still pin the version to zarr 2 in the pyproject.toml because of performance issues with zarr version 3.
📚 Documentation preview 📚: https://anemoi-datasets--220.org.readthedocs.build/en/220/