Skip to content

[FEA] Dask Array Support for rsc.pp.scrublet: A Straightforward implementation #388

@MPebworthEpana

Description

@MPebworthEpana

Right now, rsc.pp.scrublet doesn't support Dask arrays, and there's a relatively straightforward path to implement one (at least, from what I know).

Background:

  1. Scrublet only really needs to run within a sample, or batch. This is provided to the function as a 'batch_key'
  2. These samples/batches are typically on the order of < 100k cells for batches, or < 10,000 for samples, meaning that they can fit within a typical GPU's memory.

Implementation concept:

  1. Check the the anndata object has a Dask array. If so, require a batch_key be provided.
  2. Rechunk the dask array by batch_key - one dask array for each batch_key
  3. Run scrublet in memory on each GPU (.compute_chunk_sizes())
  4. Save results in obs as normal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions