Skip to content

Limit inode consumption of output data #394

@ankitpatnala

Description

@ankitpatnala

Is your feature request related to a problem? Please describe.

For IMERG data, the grid used is of finer resolution reduced Gaussian grid with much more grid points compared to ERA5. Right now, the target_times which repeats for all grid points is concatenated.
It gets difficult to operate on such a large array, i.e. when we generate zarr file containing inference samples for 1 month with 6 hour timestamps.

Describe the solution you'd like

currently chunk size can be changed using the module level variable utils.io.CHUNK_N_SAMPLES. To cope better with streams as described above, some logic should be implemented keep the number of chunks reasonable. Two possible solutions could be:

A) Define the number of chunks as a constant instead of the size of a chunk => scale chunksize per stream accordingly
B) Add optional parameter chunk_size to stream config files: if it is present it is passed to the relevant method (utils.io.ZarrIO._write_arrays) otherwise utils.io.CHUNK_N_SAMPLES is used

Describe alternatives you've considered

Using zarr3 is currently not possible due to incompatibility with anemoi-dataset. However once this is fixed zarr 3 should be used additionally to the ability to specify stream chunk size.

Additional context

Option A) would add less complexity to the code/configuration and is easier to implement. option B) would provide more control to the user. As this ultimately is more of a temporary fix to limit the inode consumption until the capabilities of zarr3 can be used, I would argue option A) should be the preferred solution.

Organisation

JSC

Metadata

Metadata

Assignees

No one assigned

    Labels

    evaluationanything related to the model evaluation pipeline

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions