Skip to content

InferenceData.from_netcdf should not close netCDF files with unloaded data #2463

@shoyer

Description

@shoyer

Describe the bug
The from_netcdf method uses xarray.open_dataset() as a context manager, even when not in "eager" mode:

with xr.open_dataset(filename, group=f"{base_group}/{group}", **group_kws) as data:
if rcParams["data.load"] == "eager":
groups[group] = data.load()
else:
groups[group] = data

This result in underlying netCDF files being closed before the new InferenceData object is returned. Because Xarray uses lazy loading of data from disk, this means the underlying data of these variables may not be accessible.

This bug is masked by a caching layer in Xarray, which does not close files opened from disk immediately, except when a threshold number of open files is exceeded (by default, 128).

To Reproduce
pydata/xarray#10571 (not yet merged) fixes a bug in Xarray, removing the caching layer for netCDF files opened from file-like objects with h5netcdf (which should always be kept open, because the caching layer cannot re-create them).

For ArviZ, this specifically causes immediate errors for calls to InferenceData.from_netcdf using the memory argument. Running arViz's test suite PR triggers many test failures, which seems to be test cases that access local test data with calls like load_arviz_data("centered_eight"). The errors seem to manifest most directly in a strange TypeError from h5netcdf, due to the underlying file being closed:

tests/base_tests/test_stats_utils.py ................................... [ 11%]
........................................................................ [ 35%]
........................................................................ [ 59%]
........................................................................ [ 82%]
...........................................F.F..F...                     [100%]

=================================== FAILURES ===================================
____________________________ test_stats_variance_2d ____________________________

    def test_stats_variance_2d():
        """Test for stats_variance_2d."""
        data_1 = np.random.randn(1000, 1000)
        data_2 = np.random.randn(1000000)
>       school = load_arviz_data("centered_eight").posterior["mu"].values

tests/base_tests/test_stats_utils.py:320: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../xarray/core/dataarray.py:797: in values
    return self.variable.values
../xarray/core/variable.py:536: in values
    return _as_array_or_item(self._data)
../xarray/core/variable.py:316: in _as_array_or_item
    data = np.asarray(data)
../xarray/core/indexing.py:509: in __array__
    return np.asarray(self.get_duck_array(), dtype=dtype, copy=copy)
../xarray/core/indexing.py:843: in get_duck_array
    self._ensure_cached()
../xarray/core/indexing.py:840: in _ensure_cached
    self.array = as_indexable(self.array.get_duck_array())
../xarray/core/indexing.py:797: in get_duck_array
    return self.array.get_duck_array()
../xarray/core/indexing.py:652: in get_duck_array
    array = self.array[self.key]
../xarray/backends/h5netcdf_.py:61: in __getitem__
    return indexing.explicit_indexing_adapter(
../xarray/core/indexing.py:1021: in explicit_indexing_adapter
    result = raw_indexing_method(raw_key.tuple)
../xarray/backends/h5netcdf_.py:68: in _getitem
    return array[key]
../h5netcdf/core.py:533: in __getitem__
    string_info = self._root._h5py.check_string_dtype(self._h5ds.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <Closed h5netcdf.Variable>

    @property
    def _h5ds(self):
        # Always refer to the root file and store not h5py object
        # subclasses:
>       return self._root._h5file[self._h5path]
E       TypeError: 'NoneType' object is not subscriptable

../h5netcdf/core.py:129: TypeError

It is likely possible to trigger errors even for netCDF files loaded from disk, but you would need to open enough files (>128) that the cache's limit is exceeded.

Expected behavior
ArviZ should avoid closing file that have not been loaded from disk already, e.g., you might try:

                data = xr.open_dataset(filename, group=f"{base_group}/{group}", **group_kws)
                if rcParams["data.load"] == "eager":
                    with data:
                        groups[group] = data.load()
                else:
                    groups[group] = data

Note that this will leave unclosed files if not using eager mode. This is not ideal -- you might also make InferenceData a context manager and/or support explicitly calling InferenceData.close(). But on the other hand, ArviZ users likely rely on this behavior of files being left open inadvertently already (because of the Xarray caching layer).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions