ManifestStore.to_zarr()
#698
Replies: 1 comment 1 reply
-
|
I'm super sorry I didn't continue the conversation here - it's super rude of me to have forgotten about this! My memory is bad and I had just come back from vacation excited to get back to things :) I think definitely the first suggestion would be the way to go for this package, but I have some broader wishes/concern for this "copy" functionality, so I'll try to continue a bit what I was saying in the zulip chat: Rechunking/new compressionSince hdf5 is by-default uncompressed and uses much smaller chunk sizes (in my experience) than zarr does/would (or none at all), I would certainly want an API to handle these, for example: import h5py, numpy as np, zarr
in_memory = np.arange(3000 * 3000).reshape((3000, 3000))
with h5py.File("foo_default.h5", mode="w") as f:
f_no_chunks = f.create_dataset("default", data=in_memory) # so no options comes out unchunked and uncompressed
assert f_no_chunks.chunks is None
with h5py.File("foo_default_chunks.h5", mode="w") as f:
f_chunks = f.create_dataset("auto_chunked", data=in_memory, chunks=True) # auto chunking comes out quite small
print("default h5 chunking", f_chunks.chunks)
with h5py.File("foo_default_compressed_chunks.h5", mode="w") as f:
f_chunks_compressed = f.create_dataset("auto_chunked_compressed", data=in_memory, compression="gzip") # compressed comes out quite small as well, same as auto without compression
print("default compressed h5 chunking", f_chunks_compressed.chunks)
z = zarr.create_array("foo.zarr", data=in_memory)
print("zarr chunks", z.chunks) # bigger than hdf5 in both casesFor me, the chunk size is the same for compressed and uncompressed hdf5, and in any case, much smaller than the zarr defaults. I would love to be able to provide our users with a Non-supported compressorsI remember you mentioning xarray tracks compressors so presumably there is some handling somewhere for this sort of thing, but both zarr and hdf5 have extensible compressor ecosystems, yeah? So I wonder about handling some of those - I'm not sure how widespread they are, but definitely something that comes to mind in terms of hindering a byte-for-byte transfer Non-numeric data typesMy data type knowledge is minimal, so apologies if I'm off-base here as well. Here's an adapted example from our anndata codebase (link is in the snippet here). I am not really sure how to translate the hdf5 string we have here to zarr, but my attempt here (also based on what we do for strings in zarr) seems to fail to produce a bytes-for-bytes equivalent: import h5py
import zarr
arr = np.array(["a", "b", "c"], dtype=np.dtypes.StringDType()) # object also has the same behavior
# See https://github.com/scverse/anndata/blob/1ba19458fd68483a9b12b3f29cea83a0daa29962/src/anndata/_io/specs/methods.py#L574-L588 for where this is applied. Every anndata store has this because a string index is stored always.
with h5py.File("string.h5", mode="w") as f:
str_dtype = h5py.special_dtype(vlen=str)
f.create_dataset("a_string", data=arr.astype(str_dtype), dtype=str_dtype)
h5_offset = f["a_string"].id.get_offset()
h5_nbytes = f["a_string"].nbytes
with open("string.h5", mode="rb") as f:
f.seek(h5_offset)
h5_bytes = f.read(h5_nbytes)
z = zarr.create_array("string.zarr", shape=(3,), dtype=zarr.core.dtype.VariableLengthUTF8())
z[...] = arr
with open("string.zarr/c/0", mode="rb") as f:
zarr_bytes = f.read()
print("zarr bytes", zarr_bytes)
print("h5 bytes", h5_bytes)
assert zarr_bytes != h5_bytesTowards an ImplementationMaybe these three things are not really valid concerns and/or unrealistic hopes, but they come to mind as shortcomings of only having a byte-for-byte or having a byte-for-byte approach as being guaranteed to generate a valid zarr store. If there were a way to guarantee a valid zarr store as the destination in the byte-for-byte approach (mainly something addressing the last two points above), I would definifely be up for implementing this (although I think there are other pitfalls potentially besides those above, like https://docs.h5py.org/en/stable/special.html#enumerated-types which I am not sure has a zarr equivalent). It sounds like I'll try to flesh what I'm going for a bit more: def blocked_copy(src: Array, dest: Array, **blocked_copy_settings) -> None:
"""Here copies from one to another happen in blocks according to user settings to control how much data is held in memory at any given time"""
...
def copy_zarr(src: Group, dest: Group, callback: Callable[[str, Group, Group, Callable[[str, Group, Group], None]], None] | None):
"""Top-level API that copies from src to dest, traversing the source group, with an optional callback to customize the creation of the destination array at a given string key."""
...
z_src = zarr.open_group(manifest_store, mode="r", zarr_format=3)
with zarr.config.set(zarr_config_settings):
z_dest = zarr.open_group(store="my_new_store.zarr", mode="w", zarr_format=3)
def callback(key: str, src: Group, dest: Group, byte_for_byte_copy: Callable[[str, Group, Group], None]]) -> None:
if isinstance(src[key], Array):
if key == "a_key_with_special_settings":
dest_array = dest.create_array(key, compression_chunking_sharding_settings_not_like_src)
src_array = src[key]
blocked_copy(src_array, dest_array, **some_blocked_copy_settings)
else:
byte_for_byte_copy(key, src, dest) # if we could guarantee somehow that the destination array is valid at the destination
else:
dest_group = dest.create_group(key)
# do some attrs updates
...
copy_zarr(z_src, z_dest)The These functions could also be used for rechunking existing zarr, or handling metadata format updates to nested stores (thinking of ome-zarr) where the actual bytes are not touched. Even further, a ConclusionI think a byte-for-byte mover would be ideal for the use-case where people want to do that exactly, but I'd have concerns about using it without guarantees on the validity of the destination. If these could be allayed or you could point me to how this could be guaranteed, I'd love to implement it here. I also think an API for traversing a Addendum/UpdateI realized that creating an in-memory representation of an on-disk To this end, I tried opening the hdf5 store with a string using File ~/Projects/Theis/VirtualiZarr/virtualizarr/parsers/hdf/hdf.py:77, in _construct_manifest_array(filepath, dataset, group)
73 attrs["_FillValue"] = encoded_cf_fill_value
75 codec_configs = [zarr_codec_config_to_v3(codec.get_config()) for codec in codecs]
---> 77 fill_value = dataset.fillvalue.item()
78 dims = tuple(_dataset_dims(dataset, group=group))
79 metadata = create_v3_array_metadata(
80 shape=dataset.shape,
81 data_type=dtype,
(...)
86 attributes=attrs,
87 )
AttributeError: 'bytes' object has no attribute 'item'and then when I added a workaround for this it crashed when trying to generate the metadata, which made more sense ...
File ~/Projects/Theis/VirtualiZarr/virtualizarr/parsers/hdf/hdf.py:81, in _construct_manifest_array(filepath, dataset, group)
79 fill_value = fill_value.item()
80 dims = tuple(_dataset_dims(dataset, group=group))
---> 81 metadata = create_v3_array_metadata(
82 shape=dataset.shape,
83 data_type=dtype,
84 chunk_shape=chunks,
85 fill_value=fill_value,
86 codecs=codec_configs,
87 dimension_names=dims,
88 attributes=attrs,
89 )
90 manifest = _dataset_chunk_manifest(filepath, dataset)
91 return ManifestArray(metadata=metadata, chunkmanifest=manifest)
File ~/Projects/Theis/VirtualiZarr/virtualizarr/manifests/utils.py:79, in create_v3_array_metadata(shape, data_type, chunk_shape, chunk_key_encoding, fill_value, codecs, attributes, dimension_names)
41 def create_v3_array_metadata(
42 shape: tuple[int, ...],
43 data_type: np.dtype,
(...)
49 dimension_names: Iterable[str] | None = None,
50 ) -> ArrayV3Metadata:
51 """
52 Create an ArrayV3Metadata instance with standard configuration.
53 This function encapsulates common patterns used across different parsers.
(...)
77 A configured ArrayV3Metadata instance with standard defaults
78 """
---> 79 zdtype = parse_data_type(data_type, zarr_format=3)
80 return ArrayV3Metadata(
81 shape=shape,
82 data_type=zdtype,
(...)
95 storage_transformers=None,
96 )
File ~/Projects/Theis/VirtualiZarr/venv/lib/python3.12/site-packages/zarr/core/dtype/__init__.py:225, in parse_data_type(dtype_spec, zarr_format)
187 def parse_data_type(
188 dtype_spec: ZDTypeLike,
189 *,
190 zarr_format: ZarrFormat,
191 ) -> ZDType[TBaseDType, TBaseScalar]:
192 """
193 Interpret the input as a ZDType.
194
(...)
223 DateTime64(endianness='little', scale_factor=10, unit='s')
224 """
--> 225 return parse_dtype(dtype_spec, zarr_format=zarr_format)
File ~/Projects/Theis/VirtualiZarr/venv/lib/python3.12/site-packages/zarr/core/dtype/__init__.py:278, in parse_dtype(dtype_spec, zarr_format)
275 return VariableLengthUTF8() # type: ignore[return-value]
276 # otherwise, we have either a numpy dtype string, or a zarr v3 dtype string, and in either case
277 # we can create a native dtype from it, and do the dtype inference from that
--> 278 return get_data_type_from_native_dtype(dtype_spec)
File ~/Projects/Theis/VirtualiZarr/venv/lib/python3.12/site-packages/zarr/core/dtype/__init__.py:174, in get_data_type_from_native_dtype(dtype)
172 else:
173 na_dtype = dtype
--> 174 return data_type_registry.match_dtype(dtype=na_dtype)
File ~/Projects/Theis/VirtualiZarr/venv/lib/python3.12/site-packages/zarr/core/dtype/registry.py:161, in DataTypeRegistry.match_dtype(self, dtype)
151 if dtype == np.dtype("O"):
152 msg = (
153 f"Zarr data type resolution from {dtype} failed. "
154 'Attempted to resolve a zarr data type from a numpy "Object" data type, which is '
(...)
159 "data type, see https://github.com/zarr-developers/zarr-python/issues/3117"
160 )
--> 161 raise ValueError(msg)
162 matched: list[ZDType[TBaseDType, TBaseScalar]] = []
163 for val in self.contents.values():
ValueError: Zarr data type resolution from object failed. Attempted to resolve a zarr data type from a numpy "Object" data type, which is ambiguous, as multiple zarr data types can be represented by the numpy "Object" data type. In this case you should construct your array by providing a specific Zarr data type. For a list of Zarr data types that are compatible with the numpy "Object"data type, see https://github.com/zarr-developers/zarr-python/issues/3117But then even if one did provide the right "interpretation" for the object data type in zarr-speak, that wouldn't mean that the destination bytes under that data type are the same as the source bytes, I think. I could make a parser that just lies and says "this h5 on-disk data is of zarr vlen data type and so when you get it in memory, it will be numpy string type" but on disk it is really however hdf5 stores it (which may or may not be the same as how zarr would represent that type on-disk). In this case, I don't know how to represent the metadata for the on-disk hdf5 bytes as a zarr dtype that would be byte-for-byte the same as as the h5 bytes. And for pure reading use-cases, I would imagine it doesn't matter. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
@ilangold asked if there was any pre-existing tool for converting HDF5 files directly into Zarr stores, and it occurred to me that you could do this very efficiently using
ManifestStore:Parseron the filepath for hdf5/other file to determine all chunk locations and metadata,Pros and cons compared to using xarray
Of course you could alternatively use VirtualiZarr like a runtime translation layer and then use xarray's
.to_zarr()like this:but that has a couple of significant downsides:
open_dataset/zarrdoes other complicated decoding steps that you also don't need here,coordinatesmetadata field,The main downsides of this suggestion are that:
Given those restrictions, I'm not sure this idea really offers anything beyond your existing two options of
ManifestStore.to_icechunk()- see Add.to_icechunk()method toManifestGroup#591).So I'm not sure this is that useful, but it's interesting! 😆
Beta Was this translation helpful? Give feedback.
All reactions