InferenceData.from_netcdf should not close netCDF files with unloaded data

**Describe the bug**
The `from_netcdf` method uses `xarray.open_dataset()` as a context manager, even when not in "eager" mode: https://github.com/arviz-devs/arviz/blob/774732588e0d128abcd61755943db63f7087077f/arviz/data/inference_data.py#L433-L437

This result in underlying netCDF files being closed before the new InferenceData object is returned. Because Xarray uses lazy loading of data from disk, this means the underlying data of these variables may not be accessible.

This bug is masked by a [caching layer](https://github.com/pydata/xarray/blob/main/xarray/backends/file_manager.py#L54-L290) in Xarray, which does not close files opened from disk immediately, except when a threshold number of open files is exceeded (by default, 128).

**To Reproduce**
https://github.com/pydata/xarray/pull/10571 (not yet merged) fixes a bug in Xarray, removing the caching layer for netCDF files opened from file-like objects with h5netcdf (which should always be kept open, because the caching layer cannot re-create them).

For ArviZ, this specifically causes immediate errors for calls to `InferenceData.from_netcdf` using the `memory` argument. Running arViz's test suite PR triggers many test failures, which seems to be test cases that access local test data with calls like `load_arviz_data("centered_eight")`. The errors seem to manifest most directly in a strange `TypeError` from `h5netcdf`, due to the underlying file being closed:
```
tests/base_tests/test_stats_utils.py ................................... [ 11%]
........................................................................ [ 35%]
........................................................................ [ 59%]
........................................................................ [ 82%]
...........................................F.F..F...                     [100%]

=================================== FAILURES ===================================
____________________________ test_stats_variance_2d ____________________________

    def test_stats_variance_2d():
        """Test for stats_variance_2d."""
        data_1 = np.random.randn(1000, 1000)
        data_2 = np.random.randn(1000000)
>       school = load_arviz_data("centered_eight").posterior["mu"].values

tests/base_tests/test_stats_utils.py:320: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../xarray/core/dataarray.py:797: in values
    return self.variable.values
../xarray/core/variable.py:536: in values
    return _as_array_or_item(self._data)
../xarray/core/variable.py:316: in _as_array_or_item
    data = np.asarray(data)
../xarray/core/indexing.py:509: in __array__
    return np.asarray(self.get_duck_array(), dtype=dtype, copy=copy)
../xarray/core/indexing.py:843: in get_duck_array
    self._ensure_cached()
../xarray/core/indexing.py:840: in _ensure_cached
    self.array = as_indexable(self.array.get_duck_array())
../xarray/core/indexing.py:797: in get_duck_array
    return self.array.get_duck_array()
../xarray/core/indexing.py:652: in get_duck_array
    array = self.array[self.key]
../xarray/backends/h5netcdf_.py:61: in __getitem__
    return indexing.explicit_indexing_adapter(
../xarray/core/indexing.py:1021: in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
../xarray/backends/h5netcdf_.py:68: in _getitem
    return array[key]
../h5netcdf/core.py:533: in __getitem__
    string_info = self._root._h5py.check_string_dtype(self._h5ds.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Closed h5netcdf.Variable>

    @property
    def _h5ds(self):
        # Always refer to the root file and store not h5py object
        # subclasses:
>       return self._root._h5file[self._h5path]
E       TypeError: 'NoneType' object is not subscriptable

../h5netcdf/core.py:129: TypeError
```

It is likely possible to trigger errors even for netCDF files loaded from disk, but you would need to open enough files (>128) that the cache's limit is exceeded.

**Expected behavior**
ArviZ should avoid closing file that have not been loaded from disk already, e.g., you might try:
```python
                data = xr.open_dataset(filename, group=f"{base_group}/{group}", **group_kws)
                if rcParams["data.load"] == "eager":
                    with data:
                        groups[group] = data.load()
                else:
                    groups[group] = data
```
Note that this will leave unclosed files if not using eager mode. This is not ideal -- you might also make `InferenceData` a context manager and/or support explicitly calling `InferenceData.close()`. But on the other hand, ArviZ users likely rely on this behavior of files being left open inadvertently already (because of the Xarray caching layer).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

InferenceData.from_netcdf should not close netCDF files with unloaded data #2463

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	with xr.open_dataset(filename, group=f"{base_group}/{group}", **group_kws) as data:
	if rcParams["data.load"] == "eager":
	groups[group] = data.load()
	else:
	groups[group] = data

Uh oh!

InferenceData.from_netcdf should not close netCDF files with unloaded data #2463

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions