-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Is your feature request related to a problem? Please describe.
For IMERG data, the grid used is of finer resolution reduced Gaussian grid with much more grid points compared to ERA5. Right now, the target_times which repeats for all grid points is concatenated.
It gets difficult to operate on such a large array, i.e. when we generate zarr file containing inference samples for 1 month with 6 hour timestamps.
Describe the solution you'd like
currently chunk size can be changed using the module level variable utils.io.CHUNK_N_SAMPLES
. To cope better with streams as described above, some logic should be implemented keep the number of chunks reasonable. Two possible solutions could be:
A) Define the number of chunks as a constant instead of the size of a chunk => scale chunksize per stream accordingly
B) Add optional parameter chunk_size
to stream config files: if it is present it is passed to the relevant method (utils.io.ZarrIO._write_arrays
) otherwise utils.io.CHUNK_N_SAMPLES
is used
Describe alternatives you've considered
Using zarr3 is currently not possible due to incompatibility with anemoi-dataset. However once this is fixed zarr 3 should be used additionally to the ability to specify stream chunk size.
Additional context
Option A) would add less complexity to the code/configuration and is easier to implement. option B) would provide more control to the user. As this ultimately is more of a temporary fix to limit the inode consumption until the capabilities of zarr3 can be used, I would argue option A) should be the preferred solution.
Organisation
JSC