-
Notifications
You must be signed in to change notification settings - Fork 54
Description
Is your feature request related to a problem? Please describe.
We, at Ouranos, have an internal database of many datasets in netCDF and in Zarr. The database is (partially) duplicated between our internal server (HPC-like but not quite) and an externally shared HPC. This other machine has a strict quota on inodes (10M over 2PB, which means an average file size of 215 Mo).
Thus, conventional Zarr folders are a no-go because they are composed of numerous really small files. A while ago, we decided to zip all of them. The impact on reading speed is insignificant compared to other productivity gains (not needing netCDF4 for example).
However, Zarr 3 has dropped the transparent support of zipped directory. One can't do xr.open_zarr('path/to/dataset.zarr.zip') anymore. One has to create a zarr.ZipStore and pass that to xarray.
Zarr's upcoming "url pipeline" (ZEP 8) fixes this in a way, but it has been "upcoming" for a while now. And the addition of kerchunk to intake-esm's dependencies has implicitly pinned zarr to >3.
Describe the solution you'd like
Somewhere in source.py::_open_dataset I think we could detect that the path is a zipped zarr and act accordingly. For example is the url endswith .zip and format == 'zarr'.
Describe alternatives you've considered
Not updating intake-esm nor zarr. But that's not long-term.
Additional context
As a side questions, if any other data managers are reading this : have you had this inode issue before ? How did you solve it ?
I'd rather fix this in intake-esm or zarr before converting our full database back to netCDFs or an other format.