Skip to content

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Aug 11, 2025

This PR includes several related improvements to netCDF reading/writing:

  1. It adds support for reading and writing in-memory netCDF4 files using netCDF4-Python, including with xarray.DataTree.
  2. It adds extensive tests for reading and writing in-memory and file-like data, across all netCDF backends.
  3. Xarray objects opened from file-like objects with engine='h5netcdf' can now be pickled, as long as the underlying file-like object also support pickle.
  4. Closing Xarray objects opened from file-like objects with engine='scipy' no longer closes the underlying file, consistent the h5netcdf backend.
  5. I've added extensive type annotations throughout xarray.backends.file_manager and xarray.backends.locks.
  6. There is a new PickleableFileManager class, which is used for wrapping file-like objects that do not natively support pickling (e.g., netCDF4.Dataset and h5netcdf.File) in cases where a global cache is not desirable (e.g., for netCDF files opened from bytes in memory, or from existing file objects).
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst

@github-actions github-actions bot added topic-backends topic-zarr Related to zarr storage library topic-DataTree Related to the implementation of a DataTree class io labels Aug 11, 2025
@shoyer
Copy link
Member Author

shoyer commented Aug 11, 2025

  • It refactors the internal structure of DataTree.to_netcdf() and DataTree.to_zarr() to use lower level interfaces, rather than calling Dataset methods. This allows for properly supporting compute=False (and likely various other improvements).

I am thinking I might try to split this into a separate PR, because it's unrelated to the netCDF4 in-memory changes.

This PR includes a handful of significant changes:

1. It refactors the internal structure of `DataTree.to_netcdf()` and
   `DataTree.to_zarr()` to use lower level interfaces, rather than
   calling `Dataset` methods. This allows for properly supporting
   `compute=False` (and likely various other improvements).
2. Reading and writing in-memory data with netCDF4-python is now
   supported, including DataTree.
3. The `engine` argument in `DataTree.to_netcdf()` is now set
   consistently with `Dataset.to_netcdf()`, preferring `netcdf4` to
   `h5netcdf`.
3. Calling `Dataset.to_netcdf()` without a target now always returns a
   `memoryview` object, *including* in the case where `engine='scipy'`
   is used (which currently returns `bytes`). This is a breaking change,
   rather than merely issuing a warning as is done in pydata#10571. I believe
   it probably makes sense to do as a this breaking change because (1)
   it offers significant performance benefits, (2) the default behavior
   without specifying an engine will already change (because `netcdf4`
   is preferred to the `scipy` backend) and (3) restoring previous
   behavior is easy (by wrapping the memoryview with `bytes()`).

mypy
@shoyer shoyer changed the title Rewrite DataTree.to_netcdf and support netCDF4 in-memory Support reading and writing memoryviews using engine='netcdf4' Sep 8, 2025
@shoyer shoyer changed the title Support reading and writing memoryviews using engine='netcdf4' In-memory reads/writes with netCDF4 and internal backends clean-up Sep 8, 2025
@shoyer shoyer changed the title In-memory reads/writes with netCDF4 and internal backends clean-up In-memory reads/writes with netCDF4 Sep 8, 2025
@shoyer shoyer marked this pull request as ready for review September 8, 2025 04:22
@shoyer
Copy link
Member Author

shoyer commented Sep 8, 2025

This PR is ready for review now.

@shoyer shoyer changed the title In-memory reads/writes with netCDF4 NetCDF IO cleanup, including in-memory reads/writes with netCDF4 Sep 9, 2025
@shoyer
Copy link
Member Author

shoyer commented Sep 9, 2025

I added a regression test for #10712

@shoyer shoyer removed topic-zarr Related to zarr storage library topic-DataTree Related to the implementation of a DataTree class labels Sep 9, 2025
@github-actions github-actions bot added the topic-DataTree Related to the implementation of a DataTree class label Sep 9, 2025
Copy link
Contributor

@kmuehlbauer kmuehlbauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 A big win! Thank you so much, Stephan!

@jsignell jsignell linked an issue Sep 11, 2025 that may be closed by this pull request
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io topic-backends topic-DataTree Related to the implementation of a DataTree class topic-typing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression: "h5py objects cannot be pickled" with cloudpickle
2 participants