Skip to content

Support for DataTree.to_netcdf to write to a file-like object or bytes #10571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Aug 8, 2025

Conversation

mjwillson
Copy link
Contributor

And some other related improvements to reading and writing of NetCDF files to/from bytes or file-like objects.

  • Allows use of h5netcdf engine when writing to file-like objects (such as BytesIO), stop forcing use of scipy backend in this case (which is incompatible with groups and DataTree). Makes h5netcdf the default engine for DataTree.to_netcdf rather than leaving the choice of default up to Dataset.to_netcdf.
  • Allows use of h5netcdf engine to read from a bytes object.
  • Allows DataTree.to_netcdf to return bytes when filepath argument is omitted (similar to Dataset.to_netcdf).

Copy link

welcome bot commented Jul 25, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@github-actions github-actions bot added topic-backends topic-DataTree Related to the implementation of a DataTree class io labels Jul 25, 2025
@mjwillson mjwillson force-pushed the datatree_netcdf_bytes branch 3 times, most recently from efe5751 to 32d1027 Compare July 26, 2025 00:18
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks @mjwillson !

@mjwillson mjwillson force-pushed the datatree_netcdf_bytes branch 3 times, most recently from 0092892 to e4f51f5 Compare July 28, 2025 23:26
finally:
if not multifile and compute: # type: ignore[redundant-expr]
store.close()

if path_or_file is None:
value = target.getvalue()
close_bytesio()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clean-up step is likely optional, because BytesIO objects will just get garbage collected:
https://www.reddit.com/r/learnpython/comments/10ermcl/does_bytesio_need_to_be_closed/

Copy link
Contributor Author

@mjwillson mjwillson Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this shouldn't be needed to free resources, but without closing it, tests show some new and ominous-looking warnings about an error in scipy's netcdf_file.close that's triggered within a destructor (either netcdf_file.__del__ or CachingFileManager.__del__, not sure) hence doesn't bubble up.

LMK if you'd like a comment about it.

mjwillson and others added 3 commits July 29, 2025 10:29
…e objects.

* Allows use of h5netcdf engine when writing to file-like objects (such as BytesIO), stop forcing use of scipy backend in this case (which is incompatible with groups and DataTree). Makes h5netcdf the default engine for DataTree.to_netcdf rather than leaving the choice of default up to Dataset.to_netcdf.
* Allows use of h5netcdf engine to read from a bytes object.
* Allows DataTree.to_netcdf to return bytes when filepath argument is omitted (similar to Dataset.to_netcdf.
…re bytes were being returned before the h5py.File had been closed, which it appears is needed for it to finish writing a valid file. This required a further workaround to prevent the BytesIO being closed by the scipy backend when it is used in a similar way.
@mjwillson mjwillson force-pushed the datatree_netcdf_bytes branch from e4f51f5 to 0176558 Compare July 29, 2025 09:29
shoyer added 2 commits July 29, 2025 12:18
I also updated the h5netcdf backend to silence warnings from not closing
files that were already open (which are issued from CachingFileManager).
@shoyer
Copy link
Member

shoyer commented Jul 31, 2025

For what it's worth, I ran this through Google's internal tests of the open source ecosystem, and only turned up failures in ArviZ -- which will be fixed by arviz-devs/arviz#2464

@mjwillson
Copy link
Contributor Author

Thanks @shoyer! Quick ping in case anyone else is able to take a look at this?

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very interesting. I'll want to try it out. Just some minor comments and suggestions.

@shoyer
Copy link
Member

shoyer commented Aug 6, 2025

One part of this change that is worth highlighting for discussion is that I've put the logic for opening a netCDF file from bytes (by converting them into BytesIO) in the high level functions like xarray.open_dataset(), rather than backend specific open_dataset() methods.

The goal here was to simplify the interface for writing backends, by using the file interface for reading bytes. We could use a similar strategy for converting Path objects into strings.

However, it occurs to me that this won't work for reading bytes with netCDF4, which has a different interface for reading/writing bytes in memory directly. So probably I should move this logic back into the backend classes.

@shoyer
Copy link
Member

shoyer commented Aug 6, 2025

Another change we should probably do is returning a memoryview object rather than bytes from to_netcdf(), which would allow us to avoid an unnecessary memory copy. This would technically be breaking backwards compatibility for existing users of to_netcdf() with scipy, but is safe enough that it may or may not be worth a deprecation cycle.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This latest commit includes:

  1. Resolves issues identified @kmuehlbauer and @keewis (thank you for taking a look!)
  2. Switches the return value of to_netcdf(engine='h5netcdf') to memoryview rather than bytes, which removes an unnecessary memory copy.
  3. Supports opening memoryview objects in open_dataset() and other opener functions
  4. Moves logic of handling bytes/memoryview objects in open_dataset() to backend classes

For now, I've only added a deprecation warning about switching from bytes to memoryview with to_netcdf(engine='scipy'), but I'm not 100% sure it's worth a deprecation cycle here.

@github-actions github-actions bot added the topic-zarr Related to zarr storage library label Aug 7, 2025
@shoyer
Copy link
Member

shoyer commented Aug 7, 2025

I've gone ahead and dropped the legacy behavior of overwriting engine to "scipy" when it is incorrectly specified. This felt like more of a bug than a feature to me.

@shoyer
Copy link
Member

shoyer commented Aug 8, 2025

Any final reviews?

Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an amazing feature, thanks @mjwillson and @shoyer!

I've added some minor questions, which came up when reading through the PR. Should not prevent us from getting this in.

Comment on lines +542 to +548
filename_or_obj : str, Path, file-like, bytes, memoryview or DataStore
Strings and Path objects are interpreted as a path to a netCDF file
or an OpenDAP URL and opened with python-netCDF4, unless the filename
ends with .gz, in which case the file is gunzipped and opened with
scipy.io.netcdf (only netCDF3 supported). Byte-strings or file-like
objects are opened by scipy.io.netcdf (netCDF3) or h5py (netCDF4/HDF).
scipy.io.netcdf (only netCDF3 supported). Bytes, memoryview and
file-like objects are opened by scipy.io.netcdf (netCDF3) or h5netcdf
(netCDF4).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the explicit mention of netCDF file here (and in the other open_*-functions is still valid in light of all the other engines which handle files of any provenience. A change to this might better be done in another PR. I just stumbled over this and wanted to keep log of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a good consideration for updating later.

@shoyer shoyer merged commit ea9f02b into pydata:main Aug 8, 2025
35 of 37 checks passed
Copy link

welcome bot commented Aug 8, 2025

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io topic-backends topic-DataTree Related to the implementation of a DataTree class topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for DataTree.to_netcdf to write to a file-like object or bytes
4 participants