Skip to content

Conversation

ilia-kats
Copy link
Contributor

@ilia-kats ilia-kats commented Feb 26, 2025

This PR adds support for Dask DataFrames in .obsm/.varm.

  • Closes #
  • Tests added
  • Release note added (or unnecessary)

@ilia-kats ilia-kats force-pushed the dask_dataframe branch 3 times, most recently from 8a4761f to 2c3b39e Compare February 26, 2025 16:44
@ilia-kats ilia-kats marked this pull request as draft February 26, 2025 16:54
@ilia-kats
Copy link
Contributor Author

The minimum_versions test is failing due to an incompatibility between the old dask version and Python 3.11.9 specifically, not sure what to do here.

Copy link

codecov bot commented Feb 26, 2025

Codecov Report

❌ Patch coverage is 81.81818% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.60%. Comparing base (b2c7a21) to head (982f882).
⚠️ Report is 92 commits behind head on main.

Files with missing lines Patch % Lines
src/anndata/compat/__init__.py 20.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1880      +/-   ##
==========================================
- Coverage   86.11%   83.60%   -2.51%     
==========================================
  Files          40       40              
  Lines        6242     6258      +16     
==========================================
- Hits         5375     5232     -143     
- Misses        867     1026     +159     
Files with missing lines Coverage Δ
src/anndata/_core/aligned_mapping.py 93.41% <100.00%> (ø)
src/anndata/_core/file_backing.py 88.79% <100.00%> (+0.09%) ⬆️
src/anndata/_core/storage.py 100.00% <100.00%> (ø)
src/anndata/_io/specs/methods.py 88.50% <100.00%> (-0.27%) ⬇️
src/anndata/tests/helpers.py 81.44% <100.00%> (-11.12%) ⬇️
src/anndata/utils.py 83.62% <100.00%> (-0.68%) ⬇️
src/anndata/compat/__init__.py 77.39% <20.00%> (-5.72%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ilia-kats ilia-kats marked this pull request as ready for review February 26, 2025 17:39
@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 27, 2025

@ilia-kats What is the use-case here that #1247 would not address? I see this is about just sticking the object onto an AnnData, not reading from disk

Furthermore, we have reports from SpatialData that dask dataframes are rather difficult to use, as was our experience motivating the above PR to use xarray instead.

I had previously tried to do this, but there were several issues that made things quite unusable

@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 27, 2025

I like that the PR is basically write-only, but want to understand more.

@ilia-kats
Copy link
Contributor Author

So my usecase is that I have several AnnData objects that I need to concatenate, and I want to do that as lazily as possible, without allocating memory, so what I'm doing is I'm converting everything in the AnnData objects to Dask equivalents and then running ad.concat. I can't use AnnCollection because a) I need outer join on vars and b) AnnCollectionView fully materializes X upon access, while I typically don't need the full matrix, I just need to run some dimensionality-reducing calculations on it (e.g. mean or var along an axis). The only thing preventing me from sticking Dask DataFrames into obsm/varm is that one needs to call compute on the shape, everything else (storage etc.) I added for completeness' sake. Admittedly, I have not yet tested actually concatenating AnnDatas with Dask DataFrames in obsm/varm.

@ilan-gold
Copy link
Contributor

ilan-gold commented Feb 28, 2025

call compute on the shape

This was one of our stumbling blocks. It required a full pass over the data the last time we checked which kind of defeats the purpose. The above PR I linked to has full lazy concatenation features.

@ilia-kats
Copy link
Contributor Author

That is good to know, I'll wait for that to be merged then, I suppose.

@ilia-kats
Copy link
Contributor Author

So I've looked at read_lazy and xarray, and I don't think that fits my usecase. I need to handle arbitrary, probably in-memory, AnnData objects, while conserving as much memory as possible. If I'm reading the xarray code correctly, Dataset.from_dataframe copies all data passed to it. Using Dask DataFrames directly would, as far as I understand, completely avoid copying, possibly at the cost of higher runtime due to having to call compute.

In the very simplest case, it looks like calling compute on the shape does not actually do any passes through the data, but simply adds the shapes of the individual data frames:

datest = dd.from_pandas(test)
datest2 = dd.from_pandas(test2)

concat = dd.concat([datest, datest2], axis=0, ignore_index=True)
concat.shape[0].optimize().pprint()
Fused(25743):
| FloorDiv: right=1
|   Add:
      Literal: value=40
|     Add: left=0
        Literal: value=10

@ilan-gold
Copy link
Contributor

@ilia-kats I assumed you wanted the dask dataframes for its lazy-loading capabilities, in which case memory shouldn't be such an issue. Perhaps we can talk offline: [email protected] - I am around all day today

@ilan-gold
Copy link
Contributor

I'm reading the xarray code correctly, Dataset.from_dataframe copies all data passed to it

But why keep the dataframes around once they are converted?

@ilan-gold
Copy link
Contributor

@ilan-gold
Copy link
Contributor

ilan-gold commented Apr 9, 2025

Here's a quick code-snippet:

import xarray as xr
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
ds = xr.Dataset.from_dataframe(iris)
assert ds["sepal_length"].data is iris["sepal_length"].array.to_numpy()

@ilia-kats
Copy link
Contributor Author

Thanks, I should have read the xarray code more carefully. I guess I'll go with that then.

@ilan-gold
Copy link
Contributor

@ilia-kats No problem, please reach out if you feel you have more needs. Zulip is the best place for longer discussions :)

@ilia-kats
Copy link
Contributor Author

Actually, turns out xarray also doesn't work:

import xarray as xr
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
ds = xr.Dataset.from_dataframe(iris)
adata = ad.AnnData(var=pd.DataFrame(index=iris.index))
adata.varm["test"] = ds

results in

AttributeError                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 adata.varm["test"] = ds

File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:214, in AlignedActual.__setitem__(self, key, value)
    213 def __setitem__(self, key: str, value: Value):
--> 214     value = self._validate_value(value, key)
    215     self._data[key] = value

File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:277, in AxisArraysBase._validate_value(self, val, key)
    275             msg = "Index.equals and pd.testing.assert_index_equal disagree"
    276             raise AssertionError(msg)
--> 277 return super()._validate_value(val, key)

File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:79, in AlignedMappingBase._validate_value(self, val, key)
     72     warn_once(
     73         "Support for Awkward Arrays is currently experimental. "
     74         "Behavior may change in the future. Please report any issues you may encounter!",
     75         ExperimentalFeatureWarning,
     76         # stacklevel=3,
     77     )
     78 for i, axis in enumerate(self.axes):
---> 79     if self.parent.shape[axis] == axis_len(val, i):
     80         continue
     81     right_shape = tuple(self.parent.shape[a] for a in self.axes)

File /usr/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    905 if not args:
    906     raise TypeError(f'{funcname} requires at least '
    907                     '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)

File /data/ilia/anndata/src/anndata/utils.py:115, in axis_len(x, axis)
    108 @singledispatch
    109 def axis_len(x, axis: Literal[0, 1]) -> int | None:
    110     """\
    111     Return the size of an array in dimension `axis`.
    112 
    113     Returns None if `x` is an awkward array with variable length in the requested dimension.
    114     """
--> 115     return x.shape[axis]

File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/common.py:305, in AttrAccessMixin.__getattr__(self, name)
    303         with suppress(KeyError):
    304             return source[name]
--> 305 raise AttributeError(
    306     f"{type(self).__name__!r} object has no attribute {name!r}"
    307 )

AttributeError: 'Dataset' object has no attribute 'shape'

and the same for

adata = ad.AnnData(var=pd.DataFrame(index=iris.index), varm={"test": ds})

@ilia-kats
Copy link
Contributor Author

Ah, I think I'm supposed to use anndata.experimental.backed._xarray.Dataset2D as a wrapper. However, that is not part of the public API.

@ilan-gold
Copy link
Contributor

That makes sense @ilia-kats - if you would like to contribute a feature that allow for the conversion (i.e., setting on an AnnData object an element that is a Dataset), we would take that. It shouldn't be that hard

@ilia-kats
Copy link
Contributor Author

Sure, I can give it a shot. However, I think that this PR (simply allowing Dask DataFrames) is less invasive than the conversion would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants