Tips for processing larger and higher-level DataTrees #150

iskandari · 2024-11-06T22:45:33Z

iskandari
Nov 6, 2024

I'd like to start by applauding the good work from @carbonplan in helping others get started with the Zarr specification and data prep. The docs and notebooks have been relatively straightforward and easy to follow. Thank you for that! 🚀

This is a general question about optimizing pyramid generation. Most of the datasets that our organization processes are 4D, with variables like species occurrence, abundance, count, etc. on a daily or weekly basis within a 3x3 km resolution raster grid on a global scale.

Attempts to scale up by adding variables along the band dimension, temporal dimension (e.g. week, day), or levels beyond 6/7 always result in crashing dask, preceded by warnings that look like this:

 UserWarning: Sending large graph of size 3.25 GiB.
This may cause some slowdown.
Consider loading the data with Dask directly
 or using futures or delayed objects to embed the data into the graph without repetition.

I've tried to solve this with different configurations of the dask client like adding more memory per worker, setting higher array chunk sizes, re-chunking the DataTree before writing to zarr, using delayed objects with persist(), but none of these seem to work and I find myself stuck writing a bunch of smaller 3D arrays, which then need to be combined with a script manually into a 4D structure by editing the .zattrs and .zmetadata files.

I have also posted about this in more detail in the zarr-python forum (not sure if there is a rule against cross-posting in GitHub discussions). Are there any good resources or workflows on how to handle these large graphs better without sending them all at once to dask? Would batch processing be the right way to go? Any advice on how to scale pyramid_reproject() would be awesome

norlandrhagen · 2024-11-07T00:36:36Z

norlandrhagen
Nov 7, 2024
Maintainer

Hey there @iskandari 👋

We've definitely run into dask OOM issues before. Do you have a public Zarr store you're working with that we could try to start building a MRE with?

A few thoughts off the top of the head.

Would pyramid_resample work for your usecase?
At what worker memory were you OOM'ing at? It might be worth a non-distributed single worker with a bunch of memory.
@maxrjones wrote up a really detailed document on profiling different resampling methods. There might be some insights learned there that could help us improving the memory usage of pyramid_reproject.

0 replies

iskandari · 2024-11-07T03:11:54Z

iskandari
Nov 7, 2024
Author

@norlandrhagen thanks for your reply

I think applying pyramid_resample by itself could work with some extra steps, I hadn't though of that. For our use case though, pyramid_reproject is the natural choice because we are starting with rasters of varying CRS so the built-in gdal warping with rasterio is very convenient.
I can reproduce this OOM directly with the carbonplan 4D zarr from geotiffs example by artificially scaling the month dimension from 12 to 48 months:

ds1_all = []
ds2_all = []
months = list(map(lambda d: d + 1, range(48)))
for i in months:
    print(i, i%12)
    path = f"{input_path}/wc2.1_2.5m_tavg_{(12 if i%12 == 0 else i%12):02g}.tif"
    ds = (
        xr.open_dataarray(path, engine="rasterio")
        .to_dataset(name="climate")
        .squeeze()
        .reset_coords(["band"], drop=True)
    )
    ds1_all.append(ds)
ds1 = xr.concat(ds1_all, pd.Index(months, name="month"))
for i in months:
    path = f"{input_path}/wc2.1_2.5m_prec_{(12 if i%12 == 0 else i%12):02g}.tif"
    ds = (
        xr.open_dataarray(path, engine="rasterio")
        .to_dataset(name="climate")
        .squeeze()
        .reset_coords(["band"], drop=True)
    )
    ds2_all.append(ds)


ds2 = xr.concat(ds2_all, pd.Index(months, name="month"))
ds2["climate"].values[ds2["climate"].values == ds2["climate"].values[0, 0, 0]] = ds1[
    "climate"
].values[0, 0, 0]
ds = xr.concat([ds1, ds2], pd.Index(["tavg", "prec"], name="band"))

# create the pyramid
dt = pyramid_reproject(
    ds, levels=LEVELS, extra_dim="band", other_chunks={'band': 2, 'month': 48}, clear_attrs=True
)
dt.ds.attrs
dt.to_zarr('test/', consolidated=True, mode="w")

I have tried the following dask configs: 4 workers / 16 GB limit, also 2 workers/ 32 GB limit and a single worker / 64 GB limit but the same error persists. Also tried increasing the scheduler's max graph size but that has no effect.

from dask import config

client = Client(n_workers=4, threads_per_worker=2, memory_limit='16GB')
client.run_on_scheduler(lambda dask_scheduler: setattr(dask_scheduler, 'max_graph_size', 32e9))

Any idea why resampling with rasterio explodes graph size? Do you think the answer is to run compute at some intermediary steps as opposed to collecting a large graph which evaluates lazily only at the end when a level is ready to be computed ?

1 reply

norlandrhagen Nov 7, 2024
Maintainer

That makes sense! I can try to take the example + your modification and try to reproduce.

Do you think the answer is to run compute at some intermediary steps as opposed to collecting a large graph which evaluates lazily only at the end when a level is ready to be computed ?

I think this could be a great debugging tool! It would be good to know if the OOM issues were happening before writing (I suspect this is the case). A bit ago we added methods to resample/reproject single levels, maybe there is a pipeline that can make use of this and reproject, write, reproject write etc. Then update the datatree level attrs at the end.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tips for processing larger and higher-level DataTrees #150

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Tips for processing larger and higher-level DataTrees #150

Uh oh!

Uh oh!

iskandari Nov 6, 2024

Replies: 2 comments · 1 reply

Uh oh!

norlandrhagen Nov 7, 2024 Maintainer

Uh oh!

Uh oh!

iskandari Nov 7, 2024 Author

Uh oh!

norlandrhagen Nov 7, 2024 Maintainer

iskandari
Nov 6, 2024

Replies: 2 comments 1 reply

norlandrhagen
Nov 7, 2024
Maintainer

iskandari
Nov 7, 2024
Author

norlandrhagen Nov 7, 2024
Maintainer