`ManifestStore.to_zarr()` #698

TomNicholas · 2025-07-18T20:50:32Z

TomNicholas
Jul 18, 2025
Maintainer

@ilangold asked if there was any pre-existing tool for converting HDF5 files directly into Zarr stores, and it occurred to me that you could do this very efficiently using ManifestStore:

Call the Parser on the filepath for hdf5/other file to determine all chunk locations and metadata,
Use obstore to get all the chunk data key-by-key,
Use obstore to put all the chunk data and metadata into the new store location key-by-key. Voila!

Pros and cons compared to using xarray

Of course you could alternatively use VirtualiZarr like a runtime translation layer and then use xarray's .to_zarr() like this:

from virtualizarr.parsers import TiffParser

manifest_store = TiffParser("<file.tiff>")

ds = xr.open_zarr(manifest_store)

ds.to_zarr('output_path.zarr')

but that has a couple of significant downsides:

Using zarr-python (here via xarray) to read chunks will decompress them, which is pointless here as you know you're about to just recompress them before writing them again,
Xarray's open_dataset/zarr does other complicated decoding steps that you also don't need here,
Xarray writes zarr with some extra conventions, particularly adding a coordinates metadata field,
Xarray Datasets / DataTrees cannot represent all Zarr stores.

The main downsides of this suggestion are that:

It would only work for file formats that are virtualizable (i.e. they can be read from efficiently using targeted byte range requests), whereas there are other file formats which are not virtualizable but whose data is still convertable to zarr (e.g. tabular formats).
You couldn't rechunk on the way, at least not without significantly complicating the implementation.
The API gives you little opportunity to drop arrays or chunks.

Given those restrictions, I'm not sure this idea really offers anything beyond your existing two options of

Creating a virtual icechunk store (either using xarray right now or even without xarray if we merge ManifestStore.to_icechunk() - see Add .to_icechunk() method to ManifestGroup #591).
Creating a native zarr store using xarray.

So I'm not sure this is that useful, but it's interesting! 😆

ilan-gold · 2025-09-13T19:51:15Z

ilan-gold
Sep 13, 2025

I'm super sorry I didn't continue the conversation here - it's super rude of me to have forgotten about this! My memory is bad and I had just come back from vacation excited to get back to things :) I think definitely the first suggestion would be the way to go for this package, but I have some broader wishes/concern for this "copy" functionality, so I'll try to continue a bit what I was saying in the zulip chat:

Rechunking/new compression

Since hdf5 is by-default uncompressed and uses much smaller chunk sizes (in my experience) than zarr does/would (or none at all), I would certainly want an API to handle these, for example:

import h5py, numpy as np, zarr

in_memory = np.arange(3000 * 3000).reshape((3000, 3000))

with h5py.File("foo_default.h5", mode="w") as f:
    f_no_chunks = f.create_dataset("default", data=in_memory) # so no options comes out unchunked and uncompressed
    assert f_no_chunks.chunks is None
with h5py.File("foo_default_chunks.h5", mode="w") as f:
    f_chunks = f.create_dataset("auto_chunked", data=in_memory, chunks=True) # auto chunking comes out quite small
    print("default h5 chunking", f_chunks.chunks)
with h5py.File("foo_default_compressed_chunks.h5", mode="w") as f:
    f_chunks_compressed = f.create_dataset("auto_chunked_compressed", data=in_memory, compression="gzip") # compressed comes out quite small as well, same as auto without compression
    print("default compressed h5 chunking", f_chunks_compressed.chunks)

z = zarr.create_array("foo.zarr", data=in_memory)
print("zarr chunks", z.chunks) # bigger than hdf5 in both cases

For me, the chunk size is the same for compressed and uncompressed hdf5, and in any case, much smaller than the zarr defaults. I would love to be able to provide our users with a copy function that we wrap up with some nice defaults, including sharding and compression. At the moment, I don't think zarr provides autosharding, and I'd love to be able to take an hdf5 file and translate it to a sharded zarr store.

Non-supported compressors

I remember you mentioning xarray tracks compressors so presumably there is some handling somewhere for this sort of thing, but both zarr and hdf5 have extensible compressor ecosystems, yeah? So I wonder about handling some of those - I'm not sure how widespread they are, but definitely something that comes to mind in terms of hindering a byte-for-byte transfer

Non-numeric data types

My data type knowledge is minimal, so apologies if I'm off-base here as well. Here's an adapted example from our anndata codebase (link is in the snippet here). I am not really sure how to translate the hdf5 string we have here to zarr, but my attempt here (also based on what we do for strings in zarr) seems to fail to produce a bytes-for-bytes equivalent:

import h5py
import zarr

arr = np.array(["a", "b", "c"], dtype=np.dtypes.StringDType()) # object also has the same behavior

# See https://github.com/scverse/anndata/blob/1ba19458fd68483a9b12b3f29cea83a0daa29962/src/anndata/_io/specs/methods.py#L574-L588 for where this is applied.  Every anndata store has this because a string index is stored always.
with h5py.File("string.h5", mode="w") as f:
    str_dtype = h5py.special_dtype(vlen=str)
    f.create_dataset("a_string", data=arr.astype(str_dtype), dtype=str_dtype)
    h5_offset = f["a_string"].id.get_offset()
    h5_nbytes = f["a_string"].nbytes

with open("string.h5", mode="rb") as f:
    f.seek(h5_offset)
    h5_bytes = f.read(h5_nbytes)

z = zarr.create_array("string.zarr", shape=(3,), dtype=zarr.core.dtype.VariableLengthUTF8())
z[...] = arr

with open("string.zarr/c/0", mode="rb") as f:
    zarr_bytes = f.read()

print("zarr bytes", zarr_bytes)
print("h5 bytes", h5_bytes)

assert zarr_bytes != h5_bytes

Towards an Implementation

Maybe these three things are not really valid concerns and/or unrealistic hopes, but they come to mind as shortcomings of only having a byte-for-byte or having a byte-for-byte approach as being guaranteed to generate a valid zarr store. If there were a way to guarantee a valid zarr store as the destination in the byte-for-byte approach (mainly something addressing the last two points above), I would definifely be up for implementing this (although I think there are other pitfalls potentially besides those above, like https://docs.h5py.org/en/stable/special.html#enumerated-types which I am not sure has a zarr equivalent). It sounds like xarray to some degree addresses these issues but with some baggage and/or lack of generality.

I'll try to flesh what I'm going for a bit more:

def blocked_copy(src: Array, dest: Array, **blocked_copy_settings) -> None:
   """Here copies from one to another happen in blocks according to user settings to control how much data is held in memory at any given time"""
   ...

def copy_zarr(src: Group, dest: Group, callback: Callable[[str, Group, Group, Callable[[str, Group, Group], None]], None] | None):
   """Top-level API that copies from src to dest, traversing the source group, with an optional callback to customize the creation of the destination array at a given string key."""
   ...

z_src = zarr.open_group(manifest_store, mode="r", zarr_format=3)
with zarr.config.set(zarr_config_settings):
    z_dest = zarr.open_group(store="my_new_store.zarr", mode="w", zarr_format=3)

def callback(key: str, src: Group, dest: Group, byte_for_byte_copy:  Callable[[str, Group, Group], None]]) -> None:
    if isinstance(src[key], Array):
        if key == "a_key_with_special_settings":
            dest_array = dest.create_array(key, compression_chunking_sharding_settings_not_like_src)
            src_array = src[key]
            blocked_copy(src_array, dest_array, **some_blocked_copy_settings)
        else:
            byte_for_byte_copy(key, src, dest) # if we could guarantee somehow that the destination array is valid at the destination
    else:
         dest_group = dest.create_group(key)
        # do some attrs updates
         ...
        
copy_zarr(z_src, z_dest)

The zarr_config_settings could control things like the codec pipeline for speeding up i/o of sharded stores as well as concurrency. The blocked_copy_settings would control basically how much data is loaded into memory at a given time. I think some or most of the above functions should be async as well (being able to copy multiple arrays simultaneously seems good?).

These functions could also be used for rechunking existing zarr, or handling metadata format updates to nested stores (thinking of ome-zarr) where the actual bytes are not touched. Even further, a kerchunk style representation could be cool where you re-write the nested (updated) group metadata somewhere else and then have it point to the original bytes.

Conclusion

I think a byte-for-byte mover would be ideal for the use-case where people want to do that exactly, but I'd have concerns about using it without guarantees on the validity of the destination. If these could be allayed or you could point me to how this could be guaranteed, I'd love to implement it here. I also think an API for traversing a Group and handling how it is created would be super cool for broader use-cases while ensuring validity of the output at the expense of an exact match to the source, potentially.

Addendum/Update

I realized that creating an in-memory representation of an on-disk h5 store could in theory guarantee the byte-for-byte compatibility - after all, if one could generate valid zarr metadata then the source and destination might be byte-for-byte the same (although also maybe not). But that assumes one can actually do such a thing.

To this end, I tried opening the hdf5 store with a string using virtualizarr and initially, it crashed at

File ~/Projects/Theis/VirtualiZarr/virtualizarr/parsers/hdf/hdf.py:77, in _construct_manifest_array(filepath, dataset, group)
     73     attrs["_FillValue"] = encoded_cf_fill_value
     75 codec_configs = [zarr_codec_config_to_v3(codec.get_config()) for codec in codecs]
---> 77 fill_value = dataset.fillvalue.item()
     78 dims = tuple(_dataset_dims(dataset, group=group))
     79 metadata = create_v3_array_metadata(
     80     shape=dataset.shape,
     81     data_type=dtype,
   (...)
     86     attributes=attrs,
     87 )

AttributeError: 'bytes' object has no attribute 'item'

and then when I added a workaround for this it crashed when trying to generate the metadata, which made more sense

...

File ~/Projects/Theis/VirtualiZarr/virtualizarr/parsers/hdf/hdf.py:81, in _construct_manifest_array(filepath, dataset, group)
     79     fill_value = fill_value.item()
     80 dims = tuple(_dataset_dims(dataset, group=group))
---> 81 metadata = create_v3_array_metadata(
     82     shape=dataset.shape,
     83     data_type=dtype,
     84     chunk_shape=chunks,
     85     fill_value=fill_value,
     86     codecs=codec_configs,
     87     dimension_names=dims,
     88     attributes=attrs,
     89 )
     90 manifest = _dataset_chunk_manifest(filepath, dataset)
     91 return ManifestArray(metadata=metadata, chunkmanifest=manifest)

File ~/Projects/Theis/VirtualiZarr/virtualizarr/manifests/utils.py:79, in create_v3_array_metadata(shape, data_type, chunk_shape, chunk_key_encoding, fill_value, codecs, attributes, dimension_names)
     41 def create_v3_array_metadata(
     42     shape: tuple[int, ...],
     43     data_type: np.dtype,
   (...)
     49     dimension_names: Iterable[str] | None = None,
     50 ) -> ArrayV3Metadata:
     51     """
     52     Create an ArrayV3Metadata instance with standard configuration.
     53     This function encapsulates common patterns used across different parsers.
   (...)
     77         A configured ArrayV3Metadata instance with standard defaults
     78     """
---> 79     zdtype = parse_data_type(data_type, zarr_format=3)
     80     return ArrayV3Metadata(
     81         shape=shape,
     82         data_type=zdtype,
   (...)
     95         storage_transformers=None,
     96     )

File ~/Projects/Theis/VirtualiZarr/venv/lib/python3.12/site-packages/zarr/core/dtype/__init__.py:225, in parse_data_type(dtype_spec, zarr_format)
    187 def parse_data_type(
    188     dtype_spec: ZDTypeLike,
    189     *,
    190     zarr_format: ZarrFormat,
    191 ) -> ZDType[TBaseDType, TBaseScalar]:
    192     """
    193     Interpret the input as a ZDType.
    194 
   (...)
    223     DateTime64(endianness='little', scale_factor=10, unit='s')
    224     """
--> 225     return parse_dtype(dtype_spec, zarr_format=zarr_format)

File ~/Projects/Theis/VirtualiZarr/venv/lib/python3.12/site-packages/zarr/core/dtype/__init__.py:278, in parse_dtype(dtype_spec, zarr_format)
    275     return VariableLengthUTF8()  # type: ignore[return-value]
    276 # otherwise, we have either a numpy dtype string, or a zarr v3 dtype string, and in either case
    277 # we can create a native dtype from it, and do the dtype inference from that
--> 278 return get_data_type_from_native_dtype(dtype_spec)

File ~/Projects/Theis/VirtualiZarr/venv/lib/python3.12/site-packages/zarr/core/dtype/__init__.py:174, in get_data_type_from_native_dtype(dtype)
    172 else:
    173     na_dtype = dtype
--> 174 return data_type_registry.match_dtype(dtype=na_dtype)

File ~/Projects/Theis/VirtualiZarr/venv/lib/python3.12/site-packages/zarr/core/dtype/registry.py:161, in DataTypeRegistry.match_dtype(self, dtype)
    151 if dtype == np.dtype("O"):
    152     msg = (
    153         f"Zarr data type resolution from {dtype} failed. "
    154         'Attempted to resolve a zarr data type from a numpy "Object" data type, which is '
   (...)
    159         "data type, see https://github.com/zarr-developers/zarr-python/issues/3117"
    160     )
--> 161     raise ValueError(msg)
    162 matched: list[ZDType[TBaseDType, TBaseScalar]] = []
    163 for val in self.contents.values():

ValueError: Zarr data type resolution from object failed. Attempted to resolve a zarr data type from a numpy "Object" data type, which is ambiguous, as multiple zarr data types can be represented by the numpy "Object" data type. In this case you should construct your array by providing a specific Zarr data type. For a list of Zarr data types that are compatible with the numpy "Object"data type, see https://github.com/zarr-developers/zarr-python/issues/3117

But then even if one did provide the right "interpretation" for the object data type in zarr-speak, that wouldn't mean that the destination bytes under that data type are the same as the source bytes, I think. I could make a parser that just lies and says "this h5 on-disk data is of zarr vlen data type and so when you get it in memory, it will be numpy string type" but on disk it is really however hdf5 stores it (which may or may not be the same as how zarr would represent that type on-disk). In this case, I don't know how to represent the metadata for the on-disk hdf5 bytes as a zarr dtype that would be byte-for-byte the same as as the h5 bytes. And for pure reading use-cases, I would imagine it doesn't matter.

1 reply

ilan-gold Sep 15, 2025

Following up on two points:

For unchunked data in hdf5, I definitely would want that to be chunked when converting to zarr and a lot of our data (in fact, by default) is uncompressed and unchunked
I just came across https://rechunker.readthedocs.io/en/latest/api.html and your name in the same sentence. This project looks in the ballpark of what I want, but the API might not be expressive enough, it seems, for the sort of nested + re-compress operations I'm looking for. Will dig into it a bit. So perhaps something built on top of this would be the way to go. Very cool though!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`ManifestStore.to_zarr()` #698

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ManifestStore.to_zarr() #698

Uh oh!

Uh oh!

TomNicholas Jul 18, 2025 Maintainer

Pros and cons compared to using xarray

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

ilan-gold Sep 13, 2025

Rechunking/new compression

Non-supported compressors

Non-numeric data types

Towards an Implementation

Conclusion

Addendum/Update

Uh oh!

ilan-gold Sep 15, 2025

`ManifestStore.to_zarr()` #698

TomNicholas
Jul 18, 2025
Maintainer

Replies: 1 comment 1 reply

ilan-gold
Sep 13, 2025