Skip to content

make pydap backend more opendap-like by downloading multiple variables in same http request #10628

@Mikejmnez

Description

@Mikejmnez

What is your issue?

I am testing performance dowloading dap4 responses from an opendap server. In dap4 there are 2 types of http responses: dmr (metadata) and dap (binary data with metadata). With opendap when requesting data, one can request multiple variables at once, and in a general scenario the data (dap) response looks like:

url = "https://<data_url>.dap?dap4.ce=/Var_name1[_slice_];/Var_name2[_slice_];.../Var_nameM[_slice_]" # or replace .dap for .nc4, but dap format can be streamed over http
session.get(url, stream=True)

where {Var_name{i}, | i=1, ... M} are all the variables in a file, and [_slice_] is their respective subsetting slice (irrelevant for this example). (fyi dap2 also supports something like this, but will focus on dap4)

pydap supports this behavior, meaning it can create such urls, and certainly one can construct such downloads with curl. However, when I download data with xarray+ pydap, each variable is downloaded separately (via its dap response). This is even true when initially creating the xarray dataset.

minimal example

import xarray as xr
import pydap
import requests_cache

print(xr.__version__)
>> 2025.7.1 # although newer versions too
print(pydap.__version__)
>> 3.5.5 


# for debugging
session = requests_cache.CachedSession()

dap4url = "dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc"

ds = xr.open_dataset(dap4url, engine='pydap', session=session, decode_times=False)

session.cache.urls()
>>> ['http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=COADSY%5B0%3A1%3A89%5D',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=TIME%5B0%3A1%3A0%5D',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=TIME%5B0%3A1%3A11%5D',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=TIME%5B11%3A1%3A11%5D',
 'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dmr']

In this scenario, each of the dimensions COADSX and COADSY was downloaded once, and TIME array was downloaded 3 times, the first element, the last element, and the entire array. This is done by xarray to decode time variables (see this explanation) when their length is greater that 1.

With this behavior, downloading data across N remote files behind opendap server with xarray leads to the following download formula

Total downloads = N (dmrs / metadata) + N * M (daps/binary data) + 2*N (when time dimension length > 1)

where N= number of remote files, and M number of variables / dims / coords with data to download (some dimensions are only named, and these are not downloadable)

Overall, this is not an issue for a few remote files (N~O(1), M~O(1)). But this becomes sub-performant when N*M~O(10) or greater (so very easily!), and each file is behind authentication as data typically is.

solution

An ideal solution is one with as little extra logic as possible on the xarray backend. The goal would be to make the total download:

Total downloads ~  O(N)  

I got such possible solution worked out, I will get a PR soon.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions