-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Support for DataTree.to_netcdf to write to a file-like object or bytes #10571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
efe5751
to
32d1027
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thanks @mjwillson !
0092892
to
e4f51f5
Compare
xarray/backends/api.py
Outdated
finally: | ||
if not multifile and compute: # type: ignore[redundant-expr] | ||
store.close() | ||
|
||
if path_or_file is None: | ||
value = target.getvalue() | ||
close_bytesio() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This clean-up step is likely optional, because BytesIO objects will just get garbage collected:
https://www.reddit.com/r/learnpython/comments/10ermcl/does_bytesio_need_to_be_closed/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this shouldn't be needed to free resources, but without closing it, tests show some new and ominous-looking warnings about an error in scipy's netcdf_file.close that's triggered within a destructor (either netcdf_file.__del__
or CachingFileManager.__del__
, not sure) hence doesn't bubble up.
LMK if you'd like a comment about it.
…e objects. * Allows use of h5netcdf engine when writing to file-like objects (such as BytesIO), stop forcing use of scipy backend in this case (which is incompatible with groups and DataTree). Makes h5netcdf the default engine for DataTree.to_netcdf rather than leaving the choice of default up to Dataset.to_netcdf. * Allows use of h5netcdf engine to read from a bytes object. * Allows DataTree.to_netcdf to return bytes when filepath argument is omitted (similar to Dataset.to_netcdf.
…re bytes were being returned before the h5py.File had been closed, which it appears is needed for it to finish writing a valid file. This required a further workaround to prevent the BytesIO being closed by the scipy backend when it is used in a similar way.
for more information, see https://pre-commit.ci
e4f51f5
to
0176558
Compare
I also updated the h5netcdf backend to silence warnings from not closing files that were already open (which are issued from CachingFileManager).
For what it's worth, I ran this through Google's internal tests of the open source ecosystem, and only turned up failures in ArviZ -- which will be fixed by arviz-devs/arviz#2464 |
Thanks @shoyer! Quick ping in case anyone else is able to take a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very interesting. I'll want to try it out. Just some minor comments and suggestions.
One part of this change that is worth highlighting for discussion is that I've put the logic for opening a netCDF file from The goal here was to simplify the interface for writing backends, by using the file interface for reading bytes. We could use a similar strategy for converting Path objects into strings. However, it occurs to me that this won't work for reading bytes with netCDF4, which has a different interface for reading/writing bytes in memory directly. So probably I should move this logic back into the backend classes. |
Another change we should probably do is returning a memoryview object rather than bytes from to_netcdf(), which would allow us to avoid an unnecessary memory copy. This would technically be breaking backwards compatibility for existing users of to_netcdf() with scipy, but is safe enough that it may or may not be worth a deprecation cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This latest commit includes:
- Resolves issues identified @kmuehlbauer and @keewis (thank you for taking a look!)
- Switches the return value of
to_netcdf(engine='h5netcdf')
tomemoryview
rather thanbytes
, which removes an unnecessary memory copy. - Supports opening
memoryview
objects inopen_dataset()
and other opener functions - Moves logic of handling
bytes
/memoryview
objects inopen_dataset()
to backend classes
For now, I've only added a deprecation warning about switching from bytes
to memoryview
with to_netcdf(engine='scipy')
, but I'm not 100% sure it's worth a deprecation cycle here.
I've gone ahead and dropped the legacy behavior of overwriting |
Any final reviews? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an amazing feature, thanks @mjwillson and @shoyer!
I've added some minor questions, which came up when reading through the PR. Should not prevent us from getting this in.
filename_or_obj : str, Path, file-like, bytes, memoryview or DataStore | ||
Strings and Path objects are interpreted as a path to a netCDF file | ||
or an OpenDAP URL and opened with python-netCDF4, unless the filename | ||
ends with .gz, in which case the file is gunzipped and opened with | ||
scipy.io.netcdf (only netCDF3 supported). Byte-strings or file-like | ||
objects are opened by scipy.io.netcdf (netCDF3) or h5py (netCDF4/HDF). | ||
scipy.io.netcdf (only netCDF3 supported). Bytes, memoryview and | ||
file-like objects are opened by scipy.io.netcdf (netCDF3) or h5netcdf | ||
(netCDF4). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the explicit mention of netCDF file here (and in the other open_*
-functions is still valid in light of all the other engines which handle files of any provenience. A change to this might better be done in another PR. I just stumbled over this and wanted to keep log of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good consideration for updating later.
And some other related improvements to reading and writing of NetCDF files to/from bytes or file-like objects.
BytesIO
), stop forcing use of scipy backend in this case (which is incompatible with groups andDataTree
). Makes h5netcdf the default engine forDataTree.to_netcdf
rather than leaving the choice of default up toDataset.to_netcdf
.DataTree.to_netcdf
to return bytes when filepath argument is omitted (similar toDataset.to_netcdf
).whats-new.rst