-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[pydap backend] enables downloading/processing multiple arrays within single http request #10629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… all together in single dap url
…ed at once (per group)
hmm - the test I see that fails (sporadically) concerns the following assertion:
where the groups have reverse ordering in the way dimensions show up ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Mikejmnez !
if self.array.id in self._cache.keys(): | ||
# safely avoid re-downloading some coordinates | ||
result = self._cache[self.array.id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Xarray should already avoid downloading 1D coordinates multiple times, because coordinates are saved in memory as NumPy arrays and pandas indexes. If this is not the case, please file a bug report to discuss :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see this was discussed earlier, and seems to be the same issue as #10560.
I would prefer a more general solution rather that something specifically for pydap, which will be harder to maintain.
try: | ||
result = np.asarray(result.data) | ||
except AttributeError: | ||
result = np.asarray(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worth some explanation. Did a change in pydap break np.asarray(result)
or is there some reason why it is not preferred?
dataset = self.array.dataset | ||
resolve_batch_for_all_variables(self.array, key, checksums=self._checksums) | ||
result = np.asarray( | ||
dataset._current_batch_promise.wait_for_result(self.array.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to avoid private APIs here?
warnings.warn( | ||
f"`batch={batch}` is currently only compatible with the `DAP4` " | ||
"protocol. Make sue the OPeNDAP server implements the `DAP4` " | ||
"protocol and then replace the scheme of the url with `dap4` " | ||
"to make use of it. Setting `batch=False`.", | ||
stacklevel=2, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, if a user explicitly specifies an invalid argument, the preferred pattern is to raise an exception, rather than warning and ignoring what the user asked for.
Likely pydap already does this? Generally, we re-raise errors from Xarray only when we can add Xarray-specific details that are helpful to users.
args = {"dataset": dataset} | ||
args["checksums"] = checksums |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please specify dataset
and checksums
with the same syntax
@@ -103,6 +134,8 @@ def open( | |||
timeout=None, | |||
verify=None, | |||
user_charset=None, | |||
batch=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to have the default be batch=None
, which means "use batching if possible"? This would expose these benefits to more users.
This is a little concerning! Not sure how this could be a bug on the Xarray side, unless we're using the wrong API for getting variable dimensions from Pydap. |
I'm seeing the same error over here: Not quite sure what to make of this, but seems to be a separate bug. |
Thanks @shoyer ! I am participating all week in a hackathon, but I will try to check and address your comments as fast as I can :) |
return Frozen(attrs) | ||
|
||
def get_dimensions(self): | ||
return Frozen(self.ds.dimensions) | ||
return Frozen(sorted(self.ds.dimensions)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To potentially address the issues with dimensions in Datatree, and the lat/lon
dimensions being inconsistently ordered, I added this sorted
to the dimensions list that the backend gets from the Pydap dataset directly. Hopefully this little fix will make it go away, but I will continue checking this issue locally and after merging main into this PR (it has not failed once yet! knocks on wood)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only dataset level dimensions, not variable level dimensions.
At the dataset level, dimension order doesn't really matter, so I doubt this is going to fix the issue, unfortunately.
whats-new.rst
With this PR, the following is true:
And so the dimensions are batched (downloaded) together in same always in DAP4.
In addition to this, and to preserve backwards functionality before, I added an backend argument
batch=True | False
. Whenbatch=True
, this makes it possible to download all non-dimension arrays in same response (ideal when streaming data to store locally).When
batch=False
, which is the default, each non-dimension array is downloaded with its own http requests, as before. This is ideal in many scenarios when performing some data exploration.When
batch=False
(False
is the default) , the last step (ds.load()
) triggers individual downloads.These changes allow a more performant download experience with xarray+pydap.
However ,must of these changes depend on a yet-to-release version of pydap (pydap3.5.6
). I want to check that things go smoothly here before making a new release, i.e. perhaps I will need to make a change to the backend base code.3.5.6
has been released!