Skip to content

Support zipped zarr #769

@aulemahal

Description

@aulemahal

Is your feature request related to a problem? Please describe.
We, at Ouranos, have an internal database of many datasets in netCDF and in Zarr. The database is (partially) duplicated between our internal server (HPC-like but not quite) and an externally shared HPC. This other machine has a strict quota on inodes (10M over 2PB, which means an average file size of 215 Mo).

Thus, conventional Zarr folders are a no-go because they are composed of numerous really small files. A while ago, we decided to zip all of them. The impact on reading speed is insignificant compared to other productivity gains (not needing netCDF4 for example).

However, Zarr 3 has dropped the transparent support of zipped directory. One can't do xr.open_zarr('path/to/dataset.zarr.zip') anymore. One has to create a zarr.ZipStore and pass that to xarray.

Zarr's upcoming "url pipeline" (ZEP 8) fixes this in a way, but it has been "upcoming" for a while now. And the addition of kerchunk to intake-esm's dependencies has implicitly pinned zarr to >3.

Describe the solution you'd like
Somewhere in source.py::_open_dataset I think we could detect that the path is a zipped zarr and act accordingly. For example is the url endswith .zip and format == 'zarr'.

Describe alternatives you've considered
Not updating intake-esm nor zarr. But that's not long-term.

Additional context
As a side questions, if any other data managers are reading this : have you had this inode issue before ? How did you solve it ?

I'd rather fix this in intake-esm or zarr before converting our full database back to netCDFs or an other format.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions