-
Notifications
You must be signed in to change notification settings - Fork 175
initial support for Dask DataFrames in obsm/varm #1880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
8a4761f
to
2c3b39e
Compare
2c3b39e
to
982f882
Compare
The minimum_versions test is failing due to an incompatibility between the old dask version and Python 3.11.9 specifically, not sure what to do here. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1880 +/- ##
==========================================
- Coverage 86.11% 83.60% -2.51%
==========================================
Files 40 40
Lines 6242 6258 +16
==========================================
- Hits 5375 5232 -143
- Misses 867 1026 +159
🚀 New features to boost your workflow:
|
@ilia-kats Furthermore, we have reports from I had previously tried to do this, but there were several issues that made things quite unusable |
I like that the PR is basically write-only, but want to understand more. |
So my usecase is that I have several AnnData objects that I need to concatenate, and I want to do that as lazily as possible, without allocating memory, so what I'm doing is I'm converting everything in the AnnData objects to Dask equivalents and then running |
This was one of our stumbling blocks. It required a full pass over the data the last time we checked which kind of defeats the purpose. The above PR I linked to has full lazy concatenation features. |
That is good to know, I'll wait for that to be merged then, I suppose. |
So I've looked at In the very simplest case, it looks like calling datest = dd.from_pandas(test)
datest2 = dd.from_pandas(test2)
concat = dd.concat([datest, datest2], axis=0, ignore_index=True)
concat.shape[0].optimize().pprint()
Fused(25743):
| FloorDiv: right=1
| Add:
Literal: value=40
| Add: left=0
Literal: value=10 |
@ilia-kats I assumed you wanted the dask dataframes for its lazy-loading capabilities, in which case memory shouldn't be such an issue. Perhaps we can talk offline: [email protected] - I am around all day today |
But why keep the dataframes around once they are converted? |
Also the code you posted is only applicable for non-extension arrays - pandas extension arrays should be 0-copy: https://github.com/pydata/xarray/blob/dd446d7d9c5f208cedc18b4b02fcf380a5ba7217/xarray/core/dataset.py#L7272-L7277 This includes https://pandas.pydata.org/docs/reference/api/pandas.arrays.NumpyExtensionArray.html (i.e., numpy), https://pandas.pydata.org/docs/reference/api/pandas.arrays.ArrowExtensionArray.html (i.e., arrow), https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html (categoricals), https://github.com/pandas-dev/pandas/blob/v2.2.3/pandas/core/arrays/string_.py#L275-L657 (string arrays), and likely others |
Here's a quick code-snippet: import xarray as xr
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
ds = xr.Dataset.from_dataframe(iris)
assert ds["sepal_length"].data is iris["sepal_length"].array.to_numpy() |
Thanks, I should have read the xarray code more carefully. I guess I'll go with that then. |
@ilia-kats No problem, please reach out if you feel you have more needs. Zulip is the best place for longer discussions :) |
Actually, turns out xarray also doesn't work: import xarray as xr
import pandas as pd
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
ds = xr.Dataset.from_dataframe(iris)
adata = ad.AnnData(var=pd.DataFrame(index=iris.index))
adata.varm["test"] = ds results in AttributeError Traceback (most recent call last)
Cell In[5], line 1
----> 1 adata.varm["test"] = ds
File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:214, in AlignedActual.__setitem__(self, key, value)
213 def __setitem__(self, key: str, value: Value):
--> 214 value = self._validate_value(value, key)
215 self._data[key] = value
File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:277, in AxisArraysBase._validate_value(self, val, key)
275 msg = "Index.equals and pd.testing.assert_index_equal disagree"
276 raise AssertionError(msg)
--> 277 return super()._validate_value(val, key)
File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:79, in AlignedMappingBase._validate_value(self, val, key)
72 warn_once(
73 "Support for Awkward Arrays is currently experimental. "
74 "Behavior may change in the future. Please report any issues you may encounter!",
75 ExperimentalFeatureWarning,
76 # stacklevel=3,
77 )
78 for i, axis in enumerate(self.axes):
---> 79 if self.parent.shape[axis] == axis_len(val, i):
80 continue
81 right_shape = tuple(self.parent.shape[a] for a in self.axes)
File /usr/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
905 if not args:
906 raise TypeError(f'{funcname} requires at least '
907 '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)
File /data/ilia/anndata/src/anndata/utils.py:115, in axis_len(x, axis)
108 @singledispatch
109 def axis_len(x, axis: Literal[0, 1]) -> int | None:
110 """\
111 Return the size of an array in dimension `axis`.
112
113 Returns None if `x` is an awkward array with variable length in the requested dimension.
114 """
--> 115 return x.shape[axis]
File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/common.py:305, in AttrAccessMixin.__getattr__(self, name)
303 with suppress(KeyError):
304 return source[name]
--> 305 raise AttributeError(
306 f"{type(self).__name__!r} object has no attribute {name!r}"
307 )
AttributeError: 'Dataset' object has no attribute 'shape' and the same for adata = ad.AnnData(var=pd.DataFrame(index=iris.index), varm={"test": ds}) |
Ah, I think I'm supposed to use |
That makes sense @ilia-kats - if you would like to contribute a feature that allow for the conversion (i.e., setting on an |
Sure, I can give it a shot. However, I think that this PR (simply allowing Dask DataFrames) is less invasive than the conversion would be. |
This PR adds support for Dask DataFrames in
.obsm
/.varm
.