Support for DataTree.to_netcdf to write to a file-like object or bytes #10571

mjwillson · 2025-07-25T23:15:14Z

And some other related improvements to reading and writing of NetCDF files to/from bytes or file-like objects.

Allows use of h5netcdf engine when writing to file-like objects (such as BytesIO), stop forcing use of scipy backend in this case (which is incompatible with groups and DataTree). Makes h5netcdf the default engine for DataTree.to_netcdf rather than leaving the choice of default up to Dataset.to_netcdf.
Allows use of h5netcdf engine to read from a bytes object.
Allows DataTree.to_netcdf to return bytes when filepath argument is omitted (similar to Dataset.to_netcdf).

Closes Support for DataTree.to_netcdf to write to a file-like object or bytes #10570
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst

welcome · 2025-07-25T23:15:17Z

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

shoyer

Amazing, thanks @mjwillson !

doc/whats-new.rst

xarray/backends/api.py

xarray/tests/test_backends_datatree.py

xarray/backends/api.py

shoyer · 2025-07-28T23:50:01Z

xarray/backends/api.py


+    if path_or_file is None:
+        value = target.getvalue()
+        close_bytesio()


This clean-up step is likely optional, because BytesIO objects will just get garbage collected:
https://www.reddit.com/r/learnpython/comments/10ermcl/does_bytesio_need_to_be_closed/

Yes this shouldn't be needed to free resources, but without closing it, tests show some new and ominous-looking warnings about an error in scipy's netcdf_file.close that's triggered within a destructor (either netcdf_file.__del__ or CachingFileManager.__del__, not sure) hence doesn't bubble up.

LMK if you'd like a comment about it.

…e objects. * Allows use of h5netcdf engine when writing to file-like objects (such as BytesIO), stop forcing use of scipy backend in this case (which is incompatible with groups and DataTree). Makes h5netcdf the default engine for DataTree.to_netcdf rather than leaving the choice of default up to Dataset.to_netcdf. * Allows use of h5netcdf engine to read from a bytes object. * Allows DataTree.to_netcdf to return bytes when filepath argument is omitted (similar to Dataset.to_netcdf.

…re bytes were being returned before the h5py.File had been closed, which it appears is needed for it to finish writing a valid file. This required a further workaround to prevent the BytesIO being closed by the scipy backend when it is used in a similar way.

for more information, see https://pre-commit.ci

I also updated the h5netcdf backend to silence warnings from not closing files that were already open (which are issued from CachingFileManager).

shoyer · 2025-07-31T21:27:55Z

For what it's worth, I ran this through Google's internal tests of the open source ecosystem, and only turned up failures in ArviZ -- which will be fixed by arviz-devs/arviz#2464

mjwillson · 2025-08-06T10:39:10Z

Thanks @shoyer! Quick ping in case anyone else is able to take a look at this?

xarray/tests/test_backends.py

kmuehlbauer

This looks very interesting. I'll want to try it out. Just some minor comments and suggestions.

xarray/backends/api.py

xarray/core/datatree.py

xarray/core/datatree_io.py

shoyer · 2025-08-06T16:59:11Z

One part of this change that is worth highlighting for discussion is that I've put the logic for opening a netCDF file from bytes (by converting them into BytesIO) in the high level functions like xarray.open_dataset(), rather than backend specific open_dataset() methods.

The goal here was to simplify the interface for writing backends, by using the file interface for reading bytes. We could use a similar strategy for converting Path objects into strings.

However, it occurs to me that this won't work for reading bytes with netCDF4, which has a different interface for reading/writing bytes in memory directly. So probably I should move this logic back into the backend classes.

shoyer · 2025-08-06T17:03:50Z

Another change we should probably do is returning a memoryview object rather than bytes from to_netcdf(), which would allow us to avoid an unnecessary memory copy. This would technically be breaking backwards compatibility for existing users of to_netcdf() with scipy, but is safe enough that it may or may not be worth a deprecation cycle.

shoyer

This latest commit includes:

Resolves issues identified @kmuehlbauer and @keewis (thank you for taking a look!)
Switches the return value of to_netcdf(engine='h5netcdf') to memoryview rather than bytes, which removes an unnecessary memory copy.
Supports opening memoryview objects in open_dataset() and other opener functions
Moves logic of handling bytes/memoryview objects in open_dataset() to backend classes

For now, I've only added a deprecation warning about switching from bytes to memoryview with to_netcdf(engine='scipy'), but I'm not 100% sure it's worth a deprecation cycle here.

xarray/core/datatree.py

xarray/backends/api.py

xarray/core/datatree_io.py

xarray/tests/test_backends.py

shoyer · 2025-08-07T21:02:14Z

I've gone ahead and dropped the legacy behavior of overwriting engine to "scipy" when it is incorrectly specified. This felt like more of a bug than a feature to me.

shoyer · 2025-08-08T00:57:27Z

Any final reviews?

kmuehlbauer

This is an amazing feature, thanks @mjwillson and @shoyer!

I've added some minor questions, which came up when reading through the PR. Should not prevent us from getting this in.

kmuehlbauer · 2025-08-08T05:39:06Z

xarray/backends/api.py

+    filename_or_obj : str, Path, file-like, bytes, memoryview or DataStore
        Strings and Path objects are interpreted as a path to a netCDF file
        or an OpenDAP URL and opened with python-netCDF4, unless the filename
        ends with .gz, in which case the file is gunzipped and opened with
-        scipy.io.netcdf (only netCDF3 supported). Byte-strings or file-like
-        objects are opened by scipy.io.netcdf (netCDF3) or h5py (netCDF4/HDF).
+        scipy.io.netcdf (only netCDF3 supported). Bytes, memoryview and
+        file-like objects are opened by scipy.io.netcdf (netCDF3) or h5netcdf
+        (netCDF4).


I'm wondering if the explicit mention of netCDF file here (and in the other open_*-functions is still valid in light of all the other engines which handle files of any provenience. A change to this might better be done in another PR. I just stumbled over this and wanted to keep log of this.

Yes, this is a good consideration for updating later.

xarray/backends/common.py

xarray/backends/api.py

welcome · 2025-08-08T21:04:53Z

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again!

This PR includes a handful of significant changes: 1. It refactors the internal structure of `DataTree.to_netcdf()` and `DataTree.to_zarr()` to use lower level interfaces, rather than calling `Dataset` methods. This allows for properly supporting `compute=False` (and likely various other improvements). 2. Reading and writing in-memory data with netCDF4-python is now supported, including DataTree. 3. The `engine` argument in `DataTree.to_netcdf()` is now set consistently with `Dataset.to_netcdf()`, preferring `netcdf4` to `h5netcdf`. 3. Calling `Dataset.to_netcdf()` without a target now always returns a `memoryview` object, *including* in the case where `engine='scipy'` is used (which currently returns `bytes`). This is a breaking change, rather than merely issuing a warning as is done in pydata#10571. I believe it probably makes sense to do as a this breaking change because (1) it offers significant performance benefits, (2) the default behavior without specifying an engine will already change (because `netcdf4` is preferred to the `scipy` backend) and (3) restoring previous behavior is easy (by wrapping the memoryview with `bytes()`).

This PR includes a handful of significant changes: 1. It refactors the internal structure of `DataTree.to_netcdf()` and `DataTree.to_zarr()` to use lower level interfaces, rather than calling `Dataset` methods. This allows for properly supporting `compute=False` (and likely various other improvements). 2. Reading and writing in-memory data with netCDF4-python is now supported, including DataTree. 3. The `engine` argument in `DataTree.to_netcdf()` is now set consistently with `Dataset.to_netcdf()`, preferring `netcdf4` to `h5netcdf`. 3. Calling `Dataset.to_netcdf()` without a target now always returns a `memoryview` object, *including* in the case where `engine='scipy'` is used (which currently returns `bytes`). This is a breaking change, rather than merely issuing a warning as is done in pydata#10571. I believe it probably makes sense to do as a this breaking change because (1) it offers significant performance benefits, (2) the default behavior without specifying an engine will already change (because `netcdf4` is preferred to the `scipy` backend) and (3) restoring previous behavior is easy (by wrapping the memoryview with `bytes()`). mypy

) * Rewrite DataTree.to_netcdf and support netCDF4 in-memory This PR includes a handful of significant changes: 1. It refactors the internal structure of `DataTree.to_netcdf()` and `DataTree.to_zarr()` to use lower level interfaces, rather than calling `Dataset` methods. This allows for properly supporting `compute=False` (and likely various other improvements). 2. Reading and writing in-memory data with netCDF4-python is now supported, including DataTree. 3. The `engine` argument in `DataTree.to_netcdf()` is now set consistently with `Dataset.to_netcdf()`, preferring `netcdf4` to `h5netcdf`. 3. Calling `Dataset.to_netcdf()` without a target now always returns a `memoryview` object, *including* in the case where `engine='scipy'` is used (which currently returns `bytes`). This is a breaking change, rather than merely issuing a warning as is done in #10571. I believe it probably makes sense to do as a this breaking change because (1) it offers significant performance benefits, (2) the default behavior without specifying an engine will already change (because `netcdf4` is preferred to the `scipy` backend) and (3) restoring previous behavior is easy (by wrapping the memoryview with `bytes()`).

github-actions bot added topic-backends topic-DataTree Related to the implementation of a DataTree class io labels Jul 25, 2025

mjwillson force-pushed the datatree_netcdf_bytes branch 3 times, most recently from efe5751 to 32d1027 Compare July 26, 2025 00:18

shoyer reviewed Jul 28, 2025

View reviewed changes

doc/whats-new.rst Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/tests/test_backends_datatree.py Show resolved Hide resolved

mjwillson force-pushed the datatree_netcdf_bytes branch 3 times, most recently from 0092892 to e4f51f5 Compare July 28, 2025 23:26

shoyer reviewed Jul 29, 2025

View reviewed changes

mjwillson and others added 3 commits July 29, 2025 10:29

[pre-commit.ci] auto fixes from pre-commit.com hooks

0176558

for more information, see https://pre-commit.ci

mjwillson force-pushed the datatree_netcdf_bytes branch from e4f51f5 to 0176558 Compare July 29, 2025 09:29

shoyer added 2 commits July 29, 2025 12:18

Move close() fixes into scipy backends

c4e2b9a

I also updated the h5netcdf backend to silence warnings from not closing files that were already open (which are issued from CachingFileManager).

Fix type annotations

21b1618

shoyer approved these changes Jul 30, 2025

View reviewed changes

shoyer added 2 commits July 30, 2025 16:17

Fix error from arViz

d778fdf

better typing and different fixes

6317631

shoyer mentioned this pull request Jul 31, 2025

InferenceData.from_netcdf should not close netCDF files with unloaded data arviz-devs/arviz#2463

Closed

keewis reviewed Aug 6, 2025

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

kmuehlbauer reviewed Aug 6, 2025

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/core/datatree.py Outdated Show resolved Hide resolved

xarray/core/datatree_io.py Outdated Show resolved Hide resolved

Fixes per review, also use memoryview for return value

8fe46c0

shoyer reviewed Aug 6, 2025

View reviewed changes

xarray/core/datatree.py Outdated Show resolved Hide resolved

xarray/backends/api.py Outdated Show resolved Hide resolved

xarray/core/datatree_io.py Outdated Show resolved Hide resolved

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

shoyer added 4 commits August 6, 2025 16:01

one more test

754737d

remove unnecessary use of BytesIO

bf2c750

remove inadvertent print()

23d5147

Fix typing

54427fe

github-actions bot added the topic-zarr Related to zarr storage library label Aug 7, 2025

shoyer added 2 commits August 7, 2025 13:44

Merge branch 'main' into datatree_netcdf_bytes

ceef536

Don't silently override engine in to_netcdf

07a3708

kmuehlbauer approved these changes Aug 8, 2025

View reviewed changes

kmuehlbauer and others added 3 commits August 8, 2025 08:24

Merge branch 'main' into datatree_netcdf_bytes

4ea9d06

Use type alias instead of refining filename_or_obj type everywhere

53739ff

Fix grammar

07bbe18

shoyer merged commit ea9f02b into pydata:main Aug 8, 2025
35 of 37 checks passed

shoyer mentioned this pull request Aug 11, 2025

NetCDF IO cleanup, including in-memory reads/writes with netCDF4 #10624

Merged

2 tasks

OriolAbril mentioned this pull request Aug 18, 2025

DataTree.to_netcdf has h5netcdf hardcoded as default #10654

Closed

5 tasks

OriolAbril mentioned this pull request Aug 25, 2025

Improve consistency of default engine and return memoryview instead of bytes from to_netcdf() #10656

Merged

2 tasks

ianhi mentioned this pull request Sep 7, 2025

Regression: "h5py objects cannot be pickled" with cloudpickle #10712

Closed

5 tasks

lbesnard mentioned this pull request Oct 2, 2025

open_mfdataset hangs indefinitely with h5netcdf and dask since commit ea9f02 -> v 2025.9.1 #10807

Open

5 tasks

Uh oh!

Support for DataTree.to_netcdf to write to a file-like object or bytes #10571

Support for DataTree.to_netcdf to write to a file-like object or bytes #10571

Uh oh!

Conversation

mjwillson commented Jul 25, 2025

Uh oh!

welcome bot commented Jul 25, 2025

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shoyer Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

mjwillson Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer commented Jul 31, 2025

Uh oh!

mjwillson commented Aug 6, 2025

Uh oh!

Uh oh!

kmuehlbauer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shoyer commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shoyer commented Aug 6, 2025

Uh oh!

shoyer left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shoyer commented Aug 7, 2025

Uh oh!

shoyer commented Aug 8, 2025

Uh oh!

kmuehlbauer left a comment

Choose a reason for hiding this comment

Uh oh!

kmuehlbauer Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

shoyer Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

welcome bot commented Aug 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mjwillson Jul 29, 2025 •

edited

Loading

shoyer commented Aug 6, 2025 •

edited

Loading

shoyer left a comment •

edited

Loading