-
-
Notifications
You must be signed in to change notification settings - Fork 454
Description
Describe the bug
The from_netcdf
method uses xarray.open_dataset()
as a context manager, even when not in "eager" mode:
arviz/arviz/data/inference_data.py
Lines 433 to 437 in 7747325
with xr.open_dataset(filename, group=f"{base_group}/{group}", **group_kws) as data: | |
if rcParams["data.load"] == "eager": | |
groups[group] = data.load() | |
else: | |
groups[group] = data |
This result in underlying netCDF files being closed before the new InferenceData object is returned. Because Xarray uses lazy loading of data from disk, this means the underlying data of these variables may not be accessible.
This bug is masked by a caching layer in Xarray, which does not close files opened from disk immediately, except when a threshold number of open files is exceeded (by default, 128).
To Reproduce
pydata/xarray#10571 (not yet merged) fixes a bug in Xarray, removing the caching layer for netCDF files opened from file-like objects with h5netcdf (which should always be kept open, because the caching layer cannot re-create them).
For ArviZ, this specifically causes immediate errors for calls to InferenceData.from_netcdf
using the memory
argument. Running arViz's test suite PR triggers many test failures, which seems to be test cases that access local test data with calls like load_arviz_data("centered_eight")
. The errors seem to manifest most directly in a strange TypeError
from h5netcdf
, due to the underlying file being closed:
tests/base_tests/test_stats_utils.py ................................... [ 11%]
........................................................................ [ 35%]
........................................................................ [ 59%]
........................................................................ [ 82%]
...........................................F.F..F... [100%]
=================================== FAILURES ===================================
____________________________ test_stats_variance_2d ____________________________
def test_stats_variance_2d():
"""Test for stats_variance_2d."""
data_1 = np.random.randn(1000, 1000)
data_2 = np.random.randn(1000000)
> school = load_arviz_data("centered_eight").posterior["mu"].values
tests/base_tests/test_stats_utils.py:320:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../xarray/core/dataarray.py:797: in values
return self.variable.values
../xarray/core/variable.py:536: in values
return _as_array_or_item(self._data)
../xarray/core/variable.py:316: in _as_array_or_item
data = np.asarray(data)
../xarray/core/indexing.py:509: in __array__
return np.asarray(self.get_duck_array(), dtype=dtype, copy=copy)
../xarray/core/indexing.py:843: in get_duck_array
self._ensure_cached()
../xarray/core/indexing.py:840: in _ensure_cached
self.array = as_indexable(self.array.get_duck_array())
../xarray/core/indexing.py:797: in get_duck_array
return self.array.get_duck_array()
../xarray/core/indexing.py:652: in get_duck_array
array = self.array[self.key]
../xarray/backends/h5netcdf_.py:61: in __getitem__
return indexing.explicit_indexing_adapter(
../xarray/core/indexing.py:1021: in explicit_indexing_adapter
result = raw_indexing_method(raw_key.tuple)
../xarray/backends/h5netcdf_.py:68: in _getitem
return array[key]
../h5netcdf/core.py:533: in __getitem__
string_info = self._root._h5py.check_string_dtype(self._h5ds.dtype)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <Closed h5netcdf.Variable>
@property
def _h5ds(self):
# Always refer to the root file and store not h5py object
# subclasses:
> return self._root._h5file[self._h5path]
E TypeError: 'NoneType' object is not subscriptable
../h5netcdf/core.py:129: TypeError
It is likely possible to trigger errors even for netCDF files loaded from disk, but you would need to open enough files (>128) that the cache's limit is exceeded.
Expected behavior
ArviZ should avoid closing file that have not been loaded from disk already, e.g., you might try:
data = xr.open_dataset(filename, group=f"{base_group}/{group}", **group_kws)
if rcParams["data.load"] == "eager":
with data:
groups[group] = data.load()
else:
groups[group] = data
Note that this will leave unclosed files if not using eager mode. This is not ideal -- you might also make InferenceData
a context manager and/or support explicitly calling InferenceData.close()
. But on the other hand, ArviZ users likely rely on this behavior of files being left open inadvertently already (because of the Xarray caching layer).