initial support for Dask DataFrames in obsm/varm #1880

ilia-kats · 2025-02-26T16:32:50Z

This PR adds support for Dask DataFrames in .obsm/.varm.

Closes #
Tests added
Release note added (or unnecessary)

ilia-kats · 2025-02-26T17:19:21Z

The minimum_versions test is failing due to an incompatibility between the old dask version and Python 3.11.9 specifically, not sure what to do here.

codecov · 2025-02-26T17:38:43Z

Codecov Report

❌ Patch coverage is 81.81818% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.60%. Comparing base (b2c7a21) to head (982f882).
⚠️ Report is 92 commits behind head on main.

Files with missing lines	Patch %	Lines
src/anndata/compat/__init__.py	20.00%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1880      +/-   ##
==========================================
- Coverage   86.11%   83.60%   -2.51%     
==========================================
  Files          40       40              
  Lines        6242     6258      +16     
==========================================
- Hits         5375     5232     -143     
- Misses        867     1026     +159

Files with missing lines	Coverage Δ
src/anndata/_core/aligned_mapping.py	`93.41% <100.00%> (ø)`
src/anndata/_core/file_backing.py	`88.79% <100.00%> (+0.09%)`	⬆️
src/anndata/_core/storage.py	`100.00% <100.00%> (ø)`
src/anndata/_io/specs/methods.py	`88.50% <100.00%> (-0.27%)`	⬇️
src/anndata/tests/helpers.py	`81.44% <100.00%> (-11.12%)`	⬇️
src/anndata/utils.py	`83.62% <100.00%> (-0.68%)`	⬇️
src/anndata/compat/__init__.py	`77.39% <20.00%> (-5.72%)`	⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ilan-gold · 2025-02-27T19:29:05Z

@ilia-kats ~~What is the use-case here that #1247 would not address?~~ I see this is about just sticking the object onto an AnnData, not reading from disk

Furthermore, we have reports from SpatialData that dask dataframes are rather difficult to use, as was our experience motivating the above PR to use xarray instead.

I had previously tried to do this, but there were several issues that made things quite unusable

ilan-gold · 2025-02-27T19:32:07Z

I like that the PR is basically write-only, but want to understand more.

ilia-kats · 2025-02-28T08:13:20Z

So my usecase is that I have several AnnData objects that I need to concatenate, and I want to do that as lazily as possible, without allocating memory, so what I'm doing is I'm converting everything in the AnnData objects to Dask equivalents and then running ad.concat. I can't use AnnCollection because a) I need outer join on vars and b) AnnCollectionView fully materializes X upon access, while I typically don't need the full matrix, I just need to run some dimensionality-reducing calculations on it (e.g. mean or var along an axis). The only thing preventing me from sticking Dask DataFrames into obsm/varm is that one needs to call compute on the shape, everything else (storage etc.) I added for completeness' sake. Admittedly, I have not yet tested actually concatenating AnnDatas with Dask DataFrames in obsm/varm.

ilan-gold · 2025-02-28T10:48:58Z

call compute on the shape

This was one of our stumbling blocks. It required a full pass over the data the last time we checked which kind of defeats the purpose. The above PR I linked to has full lazy concatenation features.

ilia-kats · 2025-02-28T10:57:09Z

That is good to know, I'll wait for that to be merged then, I suppose.

ilia-kats · 2025-04-09T09:46:26Z

So I've looked at read_lazy and xarray, and I don't think that fits my usecase. I need to handle arbitrary, probably in-memory, AnnData objects, while conserving as much memory as possible. If I'm reading the xarray code correctly, Dataset.from_dataframe copies all data passed to it. Using Dask DataFrames directly would, as far as I understand, completely avoid copying, possibly at the cost of higher runtime due to having to call compute.

In the very simplest case, it looks like calling compute on the shape does not actually do any passes through the data, but simply adds the shapes of the individual data frames:

datest = dd.from_pandas(test)
datest2 = dd.from_pandas(test2)

concat = dd.concat([datest, datest2], axis=0, ignore_index=True)
concat.shape[0].optimize().pprint()
Fused(25743):
| FloorDiv: right=1
|   Add:
      Literal: value=40
|     Add: left=0
        Literal: value=10

ilan-gold · 2025-04-09T10:03:08Z

@ilia-kats I assumed you wanted the dask dataframes for its lazy-loading capabilities, in which case memory shouldn't be such an issue. Perhaps we can talk offline: [email protected] - I am around all day today

ilan-gold · 2025-04-09T10:20:41Z

I'm reading the xarray code correctly, Dataset.from_dataframe copies all data passed to it

But why keep the dataframes around once they are converted?

ilan-gold · 2025-04-09T10:26:20Z

Also the code you posted is only applicable for non-extension arrays - pandas extension arrays should be 0-copy: https://github.com/pydata/xarray/blob/dd446d7d9c5f208cedc18b4b02fcf380a5ba7217/xarray/core/dataset.py#L7272-L7277

This includes https://pandas.pydata.org/docs/reference/api/pandas.arrays.NumpyExtensionArray.html (i.e., numpy), https://pandas.pydata.org/docs/reference/api/pandas.arrays.ArrowExtensionArray.html (i.e., arrow), https://pandas.pydata.org/docs/reference/api/pandas.Categorical.html (categoricals), https://github.com/pandas-dev/pandas/blob/v2.2.3/pandas/core/arrays/string_.py#L275-L657 (string arrays), and likely others

ilan-gold · 2025-04-09T11:17:36Z

Here's a quick code-snippet:

import xarray as xr
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
ds = xr.Dataset.from_dataframe(iris)
assert ds["sepal_length"].data is iris["sepal_length"].array.to_numpy()

ilia-kats · 2025-04-09T11:38:19Z

Thanks, I should have read the xarray code more carefully. I guess I'll go with that then.

ilan-gold · 2025-04-09T11:44:23Z

@ilia-kats No problem, please reach out if you feel you have more needs. Zulip is the best place for longer discussions :)

ilia-kats · 2025-04-09T12:01:41Z

Actually, turns out xarray also doesn't work:

import xarray as xr
import pandas as pd

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
ds = xr.Dataset.from_dataframe(iris)
adata = ad.AnnData(var=pd.DataFrame(index=iris.index))
adata.varm["test"] = ds

results in

AttributeError                            Traceback (most recent call last)
Cell In[5], line 1
----> 1 adata.varm["test"] = ds

File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:214, in AlignedActual.__setitem__(self, key, value)
    213 def __setitem__(self, key: str, value: Value):
--> 214     value = self._validate_value(value, key)
    215     self._data[key] = value

File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:277, in AxisArraysBase._validate_value(self, val, key)
    275             msg = "Index.equals and pd.testing.assert_index_equal disagree"
    276             raise AssertionError(msg)
--> 277 return super()._validate_value(val, key)

File /data/ilia/anndata/src/anndata/_core/aligned_mapping.py:79, in AlignedMappingBase._validate_value(self, val, key)
     72     warn_once(
     73         "Support for Awkward Arrays is currently experimental. "
     74         "Behavior may change in the future. Please report any issues you may encounter!",
     75         ExperimentalFeatureWarning,
     76         # stacklevel=3,
     77     )
     78 for i, axis in enumerate(self.axes):
---> 79     if self.parent.shape[axis] == axis_len(val, i):
     80         continue
     81     right_shape = tuple(self.parent.shape[a] for a in self.axes)

File /usr/lib/python3.11/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    905 if not args:
    906     raise TypeError(f'{funcname} requires at least '
    907                     '1 positional argument')
--> 909 return dispatch(args[0].__class__)(*args, **kw)

File /data/ilia/anndata/src/anndata/utils.py:115, in axis_len(x, axis)
    108 @singledispatch
    109 def axis_len(x, axis: Literal[0, 1]) -> int | None:
    110     """\
    111     Return the size of an array in dimension `axis`.
    112 
    113     Returns None if `x` is an awkward array with variable length in the requested dimension.
    114     """
--> 115     return x.shape[axis]

File /data/ilia/envs/famo/lib/python3.11/site-packages/xarray/core/common.py:305, in AttrAccessMixin.__getattr__(self, name)
    303         with suppress(KeyError):
    304             return source[name]
--> 305 raise AttributeError(
    306     f"{type(self).__name__!r} object has no attribute {name!r}"
    307 )

AttributeError: 'Dataset' object has no attribute 'shape'

and the same for

adata = ad.AnnData(var=pd.DataFrame(index=iris.index), varm={"test": ds})

ilia-kats · 2025-04-09T12:16:27Z

Ah, I think I'm supposed to use anndata.experimental.backed._xarray.Dataset2D as a wrapper. However, that is not part of the public API.

ilan-gold · 2025-04-09T13:00:51Z

That makes sense @ilia-kats - if you would like to contribute a feature that allow for the conversion (i.e., setting on an AnnData object an element that is a Dataset), we would take that. It shouldn't be that hard

ilia-kats · 2025-04-09T13:31:01Z

Sure, I can give it a shot. However, I think that this PR (simply allowing Dask DataFrames) is less invasive than the conversion would be.

ilia-kats force-pushed the dask_dataframe branch 3 times, most recently from 8a4761f to 2c3b39e Compare February 26, 2025 16:44

ilia-kats marked this pull request as draft February 26, 2025 16:54

initial support for Dask DataFrames in obsm/varm

982f882

ilia-kats force-pushed the dask_dataframe branch from 2c3b39e to 982f882 Compare February 26, 2025 17:05

ilia-kats marked this pull request as ready for review February 26, 2025 17:39

ilia-kats mentioned this pull request Apr 11, 2025

Allow xarray Datasets to be used for obs/var/obsm/varm #1966

Merged

3 tasks

initial support for Dask DataFrames in obsm/varm #1880

Are you sure you want to change the base?

initial support for Dask DataFrames in obsm/varm #1880

Uh oh!

Conversation

ilia-kats commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilia-kats commented Feb 26, 2025

Uh oh!

codecov bot commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilan-gold commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilia-kats commented Feb 28, 2025

Uh oh!

ilan-gold commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilia-kats commented Feb 28, 2025

Uh oh!

ilia-kats commented Apr 9, 2025

Uh oh!

ilan-gold commented Apr 9, 2025

Uh oh!

ilan-gold commented Apr 9, 2025

Uh oh!

ilan-gold commented Apr 9, 2025

Uh oh!

ilan-gold commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilia-kats commented Apr 9, 2025

Uh oh!

ilan-gold commented Apr 9, 2025

Uh oh!

ilia-kats commented Apr 9, 2025

Uh oh!

ilia-kats commented Apr 9, 2025

Uh oh!

ilan-gold commented Apr 9, 2025

Uh oh!

ilia-kats commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ilia-kats commented Feb 26, 2025 •

edited

Loading

codecov bot commented Feb 26, 2025 •

edited

Loading

ilan-gold commented Feb 27, 2025 •

edited

Loading

ilan-gold commented Feb 27, 2025 •

edited

Loading

ilan-gold commented Feb 28, 2025 •

edited

Loading

ilan-gold commented Apr 9, 2025 •

edited

Loading