[pydap backend] enables downloading/processing multiple arrays within single http request #10629

Mikejmnez · 2025-08-12T18:46:58Z

Closes make pydap backend more opendap-like by downloading multiple variables in same http request #10628
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

With this PR, the following is true:

import xarray as xr
from requests_cache import CachedSession
session=CachedSession(cache_name='debug')
session.cache.clear()

dap4urls = ["dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc", 
            "dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc"]

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=session, concat_dim='TIME', parallel=True, combine='nested', decode_times=False)

session.cache.urls()
>>>['http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dmr',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D%3BCOADSY%5B0%3A1%3A89%5D%3BTIME%5B0%3A1%3A11%5D&dap4.checksum=true',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology2.nc.dmr']

And so the dimensions are batched (downloaded) together in same always in DAP4.

In addition to this, and to preserve backwards functionality before, I added an backend argument batch=True | False. When batch=True, this makes it possible to download all non-dimension arrays in same response (ideal when streaming data to store locally).
When batch=False, which is the default, each non-dimension array is downloaded with its own http requests, as before. This is ideal in many scenarios when performing some data exploration.

cache_session=CachedSession(cache_name='debug')

ds = xr.open_mfdataset(dap4urls, engine='pydap', session=cache_session, parallel=True, combine='nested', concat_dim="TIME", decode_times=False, batch=True)

len(cache_session.cache.urls())
>>> 4 # 1dmr and 1 dap per file (2 files)

# triggers all non-dimension data to be downloaded in a single http request
ds.load()

len(cache_session.cache.urls())
>>> 6 # the previous 4, plus an extra request extra per file

When batch=False (False is the default) , the last step (ds.load()) triggers individual downloads.

These changes allow a more performant download experience with xarray+pydap. However ,must of these changes depend on a yet-to-release version of pydap (3.5.6). I want to check that things go smoothly here before making a new release, i.e. perhaps I will need to make a change to the backend base code. pydap 3.5.6 has been released!

… all together in single dap url

…ed at once (per group)

Mikejmnez · 2025-08-13T16:49:53Z

hmm - the test I see that fails (sporadically) concerns the following assertion:

Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...

where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

…lable

shoyer

Thanks @Mikejmnez !

shoyer · 2025-08-18T17:21:53Z

xarray/backends/pydap_.py

+        if self.array.id in self._cache.keys():
+            # safely avoid re-downloading some coordinates
+            result = self._cache[self.array.id]


Xarray should already avoid downloading 1D coordinates multiple times, because coordinates are saved in memory as NumPy arrays and pandas indexes. If this is not the case, please file a bug report to discuss :)

Ah, I see this was discussed earlier, and seems to be the same issue as #10560.

I would prefer a more general solution rather that something specifically for pydap, which will be harder to maintain.

shoyer · 2025-08-18T17:22:22Z

xarray/backends/pydap_.py

+            try:
+                result = np.asarray(result.data)
+            except AttributeError:
+                result = np.asarray(result)


This is worth some explanation. Did a change in pydap break np.asarray(result) or is there some reason why it is not preferred?

shoyer · 2025-08-18T17:23:31Z

xarray/backends/pydap_.py

+            dataset = self.array.dataset
+            resolve_batch_for_all_variables(self.array, key, checksums=self._checksums)
+            result = np.asarray(
+                dataset._current_batch_promise.wait_for_result(self.array.id)


Is it possible to avoid private APIs here?

shoyer · 2025-08-18T17:26:25Z

xarray/backends/pydap_.py

+                warnings.warn(
+                    f"`batch={batch}` is currently only compatible with the `DAP4` "
+                    "protocol. Make sue the OPeNDAP server implements the `DAP4` "
+                    "protocol and then replace the scheme of the url with `dap4` "
+                    "to make use of it. Setting `batch=False`.",
+                    stacklevel=2,
+                )


Generally, if a user explicitly specifies an invalid argument, the preferred pattern is to raise an exception, rather than warning and ignoring what the user asked for.

Likely pydap already does this? Generally, we re-raise errors from Xarray only when we can add Xarray-specific details that are helpful to users.

shoyer · 2025-08-18T17:27:00Z

xarray/backends/pydap_.py

        args = {"dataset": dataset}
+        args["checksums"] = checksums


Please specify dataset and checksums with the same syntax

shoyer · 2025-08-18T17:27:38Z

xarray/backends/pydap_.py

@@ -103,6 +134,8 @@ def open(
        timeout=None,
        verify=None,
        user_charset=None,
+        batch=False,


Would it make sense to have the default be batch=None, which means "use batching if possible"? This would expose these benefits to more users.

shoyer · 2025-08-18T17:35:40Z

hmm - the test I see that fails (sporadically) concerns the following assertion:
Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...
where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...

This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

shoyer · 2025-08-18T21:44:15Z

hmm - the test I see that fails (sporadically) concerns the following assertion:
Differing data variables:
L   group_1_var  (lon, lat) float64 16B ...
R   group_1_var  (lat, lon) float64 16B ...
where the groups have reverse ordering in the way dimensions show up ((lat,lon) vs (lon,lat)). Not sure if this is a pydap/PydapDataStore issue. I am imposing sorted into the get_dimensions method of the PydapDataStore. The local test ran fine (so nothing broke), but again this failing test did not show up on my testing...
This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap.

I'm seeing the same error over here:
#10649

Not quite sure what to make of this, but seems to be a separate bug.

Mikejmnez · 2025-08-18T22:14:07Z

Thanks @shoyer ! I am participating all week in a hackathon, but I will try to check and address your comments as fast as I can :)

Mikejmnez · 2025-08-19T17:03:40Z

xarray/backends/pydap_.py

        return Frozen(attrs)

    def get_dimensions(self):
-        return Frozen(self.ds.dimensions)
+        return Frozen(sorted(self.ds.dimensions))


To potentially address the issues with dimensions in Datatree, and the lat/lon dimensions being inconsistently ordered, I added this sorted to the dimensions list that the backend gets from the Pydap dataset directly. Hopefully this little fix will make it go away, but I will continue checking this issue locally and after merging main into this PR (it has not failed once yet! knocks on wood)

This is only dataset level dimensions, not variable level dimensions.

At the dataset level, dimension order doesn't really matter, so I doubt this is going to fix the issue, unfortunately.

Mikejmnez added 6 commits August 12, 2025 11:00

update PydapArrayWrapper to support backend batching

5fdeab9

update PydapDataStore to use backend logic in dap4 to batch variables…

baa23e3

… all together in single dap url

pydap-server it not necessary

52fd3f9

set batch=False as default

f99f400

set batch=False as default in datatree

7a0e2c5

set batch=False as default in open groups as dict

827101a

github-actions bot added topic-backends CI Continuous Integration tools dependencies Pull requests that update a dependency file io labels Aug 12, 2025

Mikejmnez changed the title ~~Pydap4 scale~~ [pydap backend] enables downloading/processing multiple arrays within single http request Aug 12, 2025

Mikejmnez added 6 commits August 12, 2025 12:16

for flaky, install pydap from repo for now

7515616

initial tests - quantify cached url

ee03ed6

adds tests to datatree backend to assert multiple dimensions download…

b3b77ab

…ed at once (per group)

update testing to show number of download urls

3ff1a9a

simplified logic

ccd7954

specify cached session debug name to actually cache urls

c244b3a

Mikejmnez marked this pull request as ready for review August 13, 2025 07:11

Mikejmnez added 2 commits August 13, 2025 00:24

fix for mypy

25b7092

user visible changes on whats-new.rst

bac007b

Mikejmnez added 9 commits August 13, 2025 09:50

impose sorted to get_dimensions method

844e580

reformat whats-new.rst

7461489

revert to install pydap from conda and not from repo

f0237a3

expose checksum as user kwarg

037ee09

include checksums optional argument in whats-new

4d6e33f

update to newest release of pydap via pip until conda install is avai…

5f2adfb

…lable

Merge branch 'main' into pydap4_scale

84305e4

use requests_cache session with retry-params when 500 errors occur

6efb311

update env yml file to use new pydap release via conda

511da84

Mikejmnez added 2 commits August 14, 2025 10:00

Merge branch 'main' into pydap4_scale

5ed9d4a

Merge branch 'main' into pydap4_scale

b3c77a0

Mikejmnez mentioned this pull request Aug 18, 2025

make pydap backend more opendap-like by downloading multiple variables in same http request #10628

Open

shoyer reviewed Aug 18, 2025

View reviewed changes

Mikejmnez commented Aug 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[pydap backend] enables downloading/processing multiple arrays within single http request #10629

[pydap backend] enables downloading/processing multiple arrays within single http request #10629

Uh oh!

Mikejmnez commented Aug 12, 2025 •

edited

Loading

Uh oh!

Mikejmnez commented Aug 13, 2025 •

edited

Loading

Uh oh!

shoyer left a comment

Uh oh!

shoyer Aug 18, 2025

Uh oh!

shoyer Aug 18, 2025

Uh oh!

shoyer Aug 18, 2025

Uh oh!

shoyer Aug 18, 2025

Uh oh!

shoyer Aug 18, 2025

Uh oh!

shoyer Aug 18, 2025

Uh oh!

shoyer Aug 18, 2025

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

Mikejmnez commented Aug 18, 2025 •

edited

Loading

Uh oh!

Mikejmnez Aug 19, 2025

Uh oh!

shoyer Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

[pydap backend] enables downloading/processing multiple arrays within single http request #10629

Are you sure you want to change the base?

[pydap backend] enables downloading/processing multiple arrays within single http request #10629

Uh oh!

Conversation

Mikejmnez commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikejmnez commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

shoyer commented Aug 18, 2025

Uh oh!

Mikejmnez commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mikejmnez commented Aug 12, 2025 •

edited

Loading

Mikejmnez commented Aug 13, 2025 •

edited

Loading

Mikejmnez commented Aug 18, 2025 •

edited

Loading