Replies: 2 comments 1 reply
-
|
Hey there @iskandari 👋 We've definitely run into dask OOM issues before. Do you have a public Zarr store you're working with that we could try to start building a MRE with? A few thoughts off the top of the head.
|
Beta Was this translation helpful? Give feedback.
-
|
@norlandrhagen thanks for your reply
ds1_all = []
ds2_all = []
months = list(map(lambda d: d + 1, range(48)))
for i in months:
print(i, i%12)
path = f"{input_path}/wc2.1_2.5m_tavg_{(12 if i%12 == 0 else i%12):02g}.tif"
ds = (
xr.open_dataarray(path, engine="rasterio")
.to_dataset(name="climate")
.squeeze()
.reset_coords(["band"], drop=True)
)
ds1_all.append(ds)
ds1 = xr.concat(ds1_all, pd.Index(months, name="month"))
for i in months:
path = f"{input_path}/wc2.1_2.5m_prec_{(12 if i%12 == 0 else i%12):02g}.tif"
ds = (
xr.open_dataarray(path, engine="rasterio")
.to_dataset(name="climate")
.squeeze()
.reset_coords(["band"], drop=True)
)
ds2_all.append(ds)
ds2 = xr.concat(ds2_all, pd.Index(months, name="month"))
ds2["climate"].values[ds2["climate"].values == ds2["climate"].values[0, 0, 0]] = ds1[
"climate"
].values[0, 0, 0]
ds = xr.concat([ds1, ds2], pd.Index(["tavg", "prec"], name="band"))
# create the pyramid
dt = pyramid_reproject(
ds, levels=LEVELS, extra_dim="band", other_chunks={'band': 2, 'month': 48}, clear_attrs=True
)
dt.ds.attrs
dt.to_zarr('test/', consolidated=True, mode="w")
from dask import config
client = Client(n_workers=4, threads_per_worker=2, memory_limit='16GB')
client.run_on_scheduler(lambda dask_scheduler: setattr(dask_scheduler, 'max_graph_size', 32e9))Any idea why resampling with |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to start by applauding the good work from @carbonplan in helping others get started with the Zarr specification and data prep. The docs and notebooks have been relatively straightforward and easy to follow. Thank you for that! 🚀
This is a general question about optimizing pyramid generation. Most of the datasets that our organization processes are 4D, with variables like species occurrence, abundance, count, etc. on a daily or weekly basis within a 3x3 km resolution raster grid on a global scale.
Attempts to scale up by adding variables along the
banddimension, temporal dimension (e.g.week,day), or levels beyond 6/7 always result in crashing dask, preceded by warnings that look like this:I've tried to solve this with different configurations of the dask client like adding more memory per worker, setting higher array chunk sizes, re-chunking the DataTree before writing to zarr, using delayed objects with
persist(), but none of these seem to work and I find myself stuck writing a bunch of smaller 3D arrays, which then need to be combined with a script manually into a 4D structure by editing the .zattrs and .zmetadata files.I have also posted about this in more detail in the zarr-python forum (not sure if there is a rule against cross-posting in GitHub discussions). Are there any good resources or workflows on how to handle these large graphs better without sending them all at once to dask? Would batch processing be the right way to go? Any advice on how to scale
pyramid_reproject()would be awesomeBeta Was this translation helpful? Give feedback.
All reactions