-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
What is your issue?
I am testing performance dowloading dap4 responses from an opendap server. In dap4 there are 2 types of http responses: dmr
(metadata) and dap
(binary data with metadata). With opendap when requesting data, one can request multiple variables at once, and in a general scenario the data (dap
) response looks like:
url = "https://<data_url>.dap?dap4.ce=/Var_name1[_slice_];/Var_name2[_slice_];.../Var_nameM[_slice_]" # or replace .dap for .nc4, but dap format can be streamed over http
session.get(url, stream=True)
where {Var_name{i}, | i=1, ... M}
are all the variables in a file, and [_slice_]
is their respective subsetting slice (irrelevant for this example). (fyi dap2 also supports something like this, but will focus on dap4)
pydap
supports this behavior, meaning it can create such urls, and certainly one can construct such downloads with curl. However, when I download data with xarray
+ pydap
, each variable is downloaded separately (via its dap response). This is even true when initially creating the xarray dataset.
minimal example
import xarray as xr
import pydap
import requests_cache
print(xr.__version__)
>> 2025.7.1 # although newer versions too
print(pydap.__version__)
>> 3.5.5
# for debugging
session = requests_cache.CachedSession()
dap4url = "dap4://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc"
ds = xr.open_dataset(dap4url, engine='pydap', session=session, decode_times=False)
session.cache.urls()
>>> ['http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=COADSX%5B0%3A1%3A179%5D',
'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=COADSY%5B0%3A1%3A89%5D',
'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=TIME%5B0%3A1%3A0%5D',
'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=TIME%5B0%3A1%3A11%5D',
'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dap?dap4.ce=TIME%5B11%3A1%3A11%5D',
'http://test.opendap.org/opendap/hyrax/data/nc/coads_climatology.nc.dmr']
In this scenario, each of the dimensions COADSX and COADSY was downloaded once, and TIME
array was downloaded 3 times, the first element, the last element, and the entire array. This is done by xarray to decode time variables (see this explanation) when their length is greater that 1.
With this behavior, downloading data across N remote files behind opendap server with xarray leads to the following download formula
Total downloads = N (dmrs / metadata) + N * M (daps/binary data) + 2*N (when time dimension length > 1)
where N
= number of remote files, and M
number of variables / dims / coords with data to download (some dimensions are only named, and these are not downloadable)
Overall, this is not an issue for a few remote files (N~O(1)
, M~O(1)
). But this becomes sub-performant when N*M~O(10)
or greater (so very easily!), and each file is behind authentication as data typically is.
solution
An ideal solution is one with as little extra logic as possible on the xarray backend. The goal would be to make the total download:
Total downloads ~ O(N)
I got such possible solution worked out, I will get a PR soon.