Skip to content

Add CLI for converting v2 metadata to v3 #3257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 66 commits into
base: main
Choose a base branch
from

Conversation

K-Meech
Copy link
Contributor

@K-Meech K-Meech commented Jul 16, 2025

For #1798

Adds a CLI using typer to convert v2 metadata (.zarray / .zattrs...) to v3 metadata zarr.json.

To test, you will need to install the new optional cli dependency e.g.
pip install -e ".[remote,cli]"

This should make the zarr-converter command available e.g. try:

zarr-converter --help
zarr-converter convert --help
zarr-converter clear --help

convert adds zarr.json files to every group / array, leaving the v2 metadata as-is. A zarr with both sets of metadata can still be opened with zarr.open, but will give a UserWarning: Both zarr.json (Zarr format 3) and .zarray (Zarr format 2) metadata objects exist... Zarr v3 will be used.. This can be avoided by passing zarr_format=3 to zarr.open, or by using the clear command to remove the v2 metadata.

clear can also remove v3 metadata. This is useful if the conversion fails part way through e.g. if one of the arrays uses a codec with no v3 equivalent.

All code for the cli is in src/zarr/core/metadata/converter/cli.py, with the actual conversion functions in src/zarr/core/metadata/converter/converter_v2_v3.py. These functions can be called directly, for those who don't want to use the CLI (although currently they are part of /core which is considered private API, so it may be best to move them elsewhere in the package).

Some points to consider:

  • I had to modify set_path from test_dtype_registry.py and test_codec_entrypoints.py, as they were causing the CLI tests to fail if they were run after. This seems to be due to the lazy_load_list of the numcodecs codecs registries being cleared, meaning they were no longer available in my code which finds the numcodecs.zarr3 equivalent of a numcodecs codec.
  • I tested this on local zarr images, so it would be great if someone with access to s3 / google cloud etc., could try it out on some small example images there.
  • I'm happy to add docs about how to use the CLI, but wanted to get feedback on the general structure first

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 16, 2025
@dstansby
Copy link
Contributor

I'd suggest putting the metadata converter python API in a new zarr.metadata submodule (which can contain other bits of metadata migrated from zarr.core.metadata later), and hiding the the CLI wrapper in a new private zarr._cli submodule (or zarr._cli.py file)

@dstansby dstansby added this to the 3.2.0 milestone Jul 31, 2025
@K-Meech
Copy link
Contributor Author

K-Meech commented Aug 1, 2025

I've made most of the requested changes to the implementation now. Any remaining I've responded to in the review thread above - @dstansby let me know if you have any suggestions for these, or for the refactored implementation / tests.

While changing the migration functions to accept Store directly (rather than StoreLike), I made some changes in zarr/storage/_common.py. This refactored make_store_path, to add a new make_store function. I also updated the docstrings - the existing make_store_path one was a bit out of date (e.g. didn't mention FSMap).

Also, I noticed that I don't have any handling for consolidated metadata (.zmetadata) files at the moment. Would it be fine to just error if this is encountered, or should conversion be included?

@dstansby
Copy link
Contributor

dstansby commented Aug 4, 2025

While changing the migration functions to accept Store directly (rather than StoreLike), I made some changes in zarr/storage/_common.py. This refactored make_store_path, to add a new make_store function. I also updated the docstrings - the existing make_store_path one was a bit out of date (e.g. didn't mention FSMap).

Nice! Unfortunately, in pursuit of fixing #3295 I also did a refactor at #3308 - I think we should probably merge my PR first (sorry!) and then rebase this one later, since this is a feature and my refactor is a pathway to fixing a bug.

Also, I noticed that I don't have any handling for consolidated metadata (.zmetadata) files at the moment. Would it be fine to just error if this is encountered, or should conversion be included?

I think it's fine to gracefully error on consolidated metadata for now, and add support as a follow up feature in a future PR.

@K-Meech
Copy link
Contributor Author

K-Meech commented Aug 4, 2025

Thanks for the info @dstansby ! I see that #3308 has been merged now, so I'll go ahead and fix the merge conflicts with this branch.

@K-Meech
Copy link
Contributor Author

K-Meech commented Aug 4, 2025

All conflicts are now fixed. I also added a line to stop conversion if consolidated metadata is detected.

@dstansby dstansby modified the milestones: 3.2.0, 3.1.2 Aug 5, 2025
Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! I think outstanding todos here are:

  • Add a release note entry
  • Add a new section to the user guide docs to advertise the new CLI
  • Work out how to do logging (see inline thread discussion)

@d-v-b
Copy link
Contributor

d-v-b commented Aug 6, 2025

as a general comment about converting v2 -> v3, converting the v2 filters, compressor to teh v3 codecs is hard. I suspect a general solution will not really be possible in this PR, because it relies on features that don't exist in Zarr Python yet. I am working on this (see #3162, #3276), but it's a big lift. Getting PRs like #3318 merged is an important part of this process. But once we do this these features in, we can generically convert filters + compressors to codecs.

@K-Meech
Copy link
Contributor Author

K-Meech commented Aug 6, 2025

Thanks @d-v-b - at the moment this PR:

  • explicitly converts v2 numcodecs blosc / zstd / gzip to the v3 zarr.codecs equivalent.
  • For any other v2 numcodecs codec, it tries to find a matching numcodecs.zarr3 codec (by name), then initialises it with same settings from .get_config. This works fine for some codecs, but does cause issues for others e.g. Numcodecs Delta filter throws AttributeError when astype is specified #3256 .

Happy to wait on merging this PR until the other issues / PRs you mentioned are resolved.

@d-v-b
Copy link
Contributor

d-v-b commented Aug 6, 2025

@K-Meech that approach seems good, and I don't this effort should be blocked by by codec changes in the background.

@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Aug 6, 2025
@K-Meech
Copy link
Contributor Author

K-Meech commented Aug 7, 2025

@dstansby I think I've addressed all of your new comments now + I added release notes and a user guide docs page. There's still a comment about filters from a while ago - any thoughts on that one?

Also, let me know if you have any comments on the new changes - I had to make some small modifications to handle conflicts with the latest changes to the main branch.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 I think this is good now - I have one question about use of the logger, but it's not a blocker. I'll let this sit for a week or so because it's complicated, and would benefit from a second reviewer. If no-one reviews by then, I'll merge.


app = typer.Typer()

logger = logging.getLogger(__name__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it deliberate that this is a new logger, instead of importing the logger object from zarr? I don't tihnk it matters too much, but re-using zarr._logger might save some code duplication because you could remove functions from this file for configuring the logger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants