Skip to content

Conversation

Mikejmnez
Copy link
Contributor

@Mikejmnez Mikejmnez commented Aug 12, 2025

With this PR, the following is true:

import xarray as xr
from requests_cache import CachedSession
session=CachedSession(cache_name='debug')
session.cache.clear()

dap4urls = ["dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc", 
            "dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc"]

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=session, concat_dim='TIME', parallel=True, combine='nested', decode_times=False)

session.cache.urls()
>>>['http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dmr',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dmr']

And so the dimensions are batched (downloaded) together in same always in DAP4.

In addition to this, and to preserve backwards functionality before, I added an backend argument batch=True | False. When batch=True, this makes it possible to download all non-dimension arrays in same response (ideal when streaming data to store locally).
When batch=False, which is the default, each non-dimension array is downloaded with its own http requests, as before. This is ideal in many scenarios when performing some data exploration.

cache_session=CachedSession(cache_name='debug')

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=cache_session, parallel=True, combine='nested', concat_dim="TIME", decode_times=False, batch=True)

len(cache_session.cache.urls())
>>> 4 # 1dmr and 1 dap per file (2 files)

# triggers all non-dimension data to be downloaded in a single http request
ds.load()

len(cache_session.cache.urls())
>>> 6 # the previous 4, plus an extra request extra per file 

When batch=False (False is the default) , the last step (ds.load()) triggers individual downloads.

These changes allow a more performant download experience with xarray+pydap. However ,must of these changes depend on a yet-to-release version of pydap (3.5.6). I want to check that things go smoothly here before making a new release, i.e. perhaps I will need to make a change to the backend base code. pydap 3.5.6 has been released!

@github-actions github-actions bot added topic-backends CI Continuous Integration tools dependencies Pull requests that update a dependency file io labels Aug 12, 2025
@Mikejmnez Mikejmnez changed the title Pydap4 scale [pydap backend] enables downloading/processing multiple arrays within single http request Aug 12, 2025
@Mikejmnez Mikejmnez marked this pull request as ready for review August 13, 2025 07:11
@Mikejmnez
Copy link
Contributor Author

Mikejmnez commented Aug 13, 2025

hmm - the test I see that fails (sporadically) concerns the following assertion:

Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...

where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mikejmnez !

Comment on lines +59 to +61
if self.array.id in self._cache.keys():
# safely avoid re-downloading some coordinates
result = self._cache[self.array.id]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xarray should already avoid downloading 1D coordinates multiple times, because coordinates are saved in memory as NumPy arrays and pandas indexes. If this is not the case, please file a bug report to discuss :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see this was discussed earlier, and seems to be the same issue as #10560.

I would prefer a more general solution rather that something specifically for pydap, which will be harder to maintain.

Comment on lines +73 to +76
try:
result = np.asarray(result.data)
except AttributeError:
result = np.asarray(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worth some explanation. Did a change in pydap break np.asarray(result) or is there some reason why it is not preferred?

dataset = self.array.dataset
resolve_batch_for_all_variables(self.array, key, checksums=self._checksums)
result = np.asarray(
dataset._current_batch_promise.wait_for_result(self.array.id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to avoid private APIs here?

Comment on lines +179 to +185
warnings.warn(
f"`batch={batch}` is currently only compatible with the `DAP4` "
"protocol. Make sue the OPeNDAP server implements the `DAP4` "
"protocol and then replace the scheme of the url with `dap4` "
"to make use of it. Setting `batch=False`.",
stacklevel=2,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, if a user explicitly specifies an invalid argument, the preferred pattern is to raise an exception, rather than warning and ignoring what the user asked for.

Likely pydap already does this? Generally, we re-raise errors from Xarray only when we can add Xarray-specific details that are helpful to users.

Comment on lines 169 to +170
args = {"dataset": dataset}
args["checksums"] = checksums
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please specify dataset and checksums with the same syntax

@@ -103,6 +134,8 @@ def open(
timeout=None,
verify=None,
user_charset=None,
batch=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have the default be batch=None, which means "use batching if possible"? This would expose these benefits to more users.

@shoyer
Copy link
Member

shoyer commented Aug 18, 2025

hmm - the test I see that fails (sporadically) concerns the following assertion:

Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...

where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

@shoyer
Copy link
Member

shoyer commented Aug 18, 2025

hmm - the test I see that fails (sporadically) concerns the following assertion:

Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...

where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

I'm seeing the same error over here:
#10649

Not quite sure what to make of this, but seems to be a separate bug.

@Mikejmnez
Copy link
Contributor Author

Mikejmnez commented Aug 18, 2025

Thanks @shoyer ! I am participating all week in a hackathon, but I will try to check and address your comments as fast as I can :)

return Frozen(attrs)

def get_dimensions(self):
return Frozen(self.ds.dimensions)
return Frozen(sorted(self.ds.dimensions))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To potentially address the issues with dimensions in Datatree, and the lat/lon dimensions being inconsistently ordered, I added this sorted to the dimensions list that the backend gets from the Pydap dataset directly. Hopefully this little fix will make it go away, but I will continue checking this issue locally and after merging main into this PR (it has not failed once yet! knocks on wood)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only dataset level dimensions, not variable level dimensions.

At the dataset level, dimension order doesn't really matter, so I doubt this is going to fix the issue, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration tools dependencies Pull requests that update a dependency file io topic-backends
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make pydap backend more opendap-like by downloading multiple variables in same http request
2 participants