perf: use `name` in `map_blocks` to bypass tokenization #2121

ilan-gold · 2025-09-16T14:58:57Z

Closes ad.concat is slow on lazy data on account of tokenize #1989
Tests added (benchmark)
Release note added (or unnecessary)

codecov · 2025-09-16T15:01:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.44%. Comparing base (52344db) to head (39cb3c3).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2121      +/-   ##
==========================================
- Coverage   84.58%   84.44%   -0.14%     
==========================================
  Files          46       46              
  Lines        7105     7106       +1     
==========================================
- Hits         6010     6001       -9     
- Misses       1095     1105      +10

Files with missing lines	Coverage Δ
src/anndata/_core/merge.py	`85.11% <100.00%> (+0.02%)`	⬆️

... and 2 files with indirect coverage changes

scverse-benchmark · 2025-09-16T15:29:55Z

Benchmark changes

Change	Before [`52344db`]	After [`39cb3c3`]	Ratio	Benchmark (Parameter)
-	647±2ms	239±3ms	0.37	dataset2d.Dataset2D.time_concat('h5ad', (-1,), 'cat')
-	651±3ms	209±0.7ms	0.32	dataset2d.Dataset2D.time_concat('h5ad', None, 'cat')
-	988±100ms	499±10ms	0.51	dataset2d.Dataset2D.time_concat('zarr', (-1,), 'cat')
-	1.11±0.01s	566±20ms	0.51	dataset2d.Dataset2D.time_concat('zarr', None, 'cat')
-	2.21±0.1ms	1.96±0.01ms	0.89	dataset2d.Dataset2D.time_full_to_memory('h5ad', (-1,), 'cat')
-	4.46±0.04ms	3.43±0.2ms	0.77	dataset2d.Dataset2D.time_full_to_memory('h5ad', (-1,), 'numeric')
+	5.47±0.09ms	6.56±0.04ms	1.2	dataset2d.Dataset2D.time_getitem_bool_mask('h5ad', (-1,), 'string-array')
-	16.4±0.3ms	14.2±0.5ms	0.87	dataset2d.Dataset2D.time_getitem_slice('h5ad', None, 'numeric')
-	15.1±0.1ms	13.4±0.07ms	0.89	dataset2d.Dataset2D.time_read_lazy_default('h5ad', (-1,), 'numeric')
-	6.0	5.0	0.83	readwrite.H5ADWriteSuite.track_peakmem_write_compressed('pbmc3k')

Comparison: https://github.com/scverse/anndata/compare/52344dbb40037704f15d79bdd9329f31ed75074d..39cb3c3c7c76876342ff3c206f771f96e79a9987
Last changed: Tue, 28 Oct 2025 16:44:52 +0000

More details: https://github.com/scverse/anndata/pull/2121/checks?check_run_id=53884619945

src/anndata/_io/specs/registry.py

… into ig/accelerate_map_blocks

ilan-gold · 2025-10-28T17:02:15Z

benchmarks/benchmarks/sparse_dataset.py

            res.compute()
+
+
+class SparseCSRDask:


I considered use name for the sparse block mapping but it had no appreciable effect: https://github.com/scverse/anndata/runs/51718340360 is the result and

anndata/src/anndata/_io/specs/lazy_methods.py

Lines 172 to 178 in 59041d4

da_mtx = da.map_blocks(

make_chunk,

dtype=dtype,

chunks=chunk_layout,

meta=memory_format((0, 0), dtype=dtype),

name=f"{uuid.uuid4()}/{path_or_sparse_dataset}/{elem_name}-{dtype}",

)

shows that the commit on which the benchmark was run contained the name parameter. I leave the benchmark in anyway

ilan-gold · 2025-10-28T17:03:27Z

Ok @flying-sheep I know you reviewed, but the PR is grossly simplified now and hopefully the becnhmark results make some sense i.e., the big change is concat

…tokenization

…to bypass tokenization) (#2191) Co-authored-by: Ilan Gold <[email protected]>

ilan-gold added 2 commits September 16, 2025 16:42

fix: use name to speed up .map_blocks

cd1e7c3

chore: add concat benchmark

51fee0a

ilan-gold marked this pull request as draft September 16, 2025 14:59

ilan-gold added 2 commits September 16, 2025 17:08

fix: zarr path

03e5bea

fix: docstring test

bd5ae70

ilan-gold added the skip-gpu-ci label Sep 16, 2025

ilan-gold added this to the 0.12.2 milestone Sep 16, 2025

ilan-gold added the benchmark label Sep 16, 2025

ilan-gold added 3 commits September 23, 2025 10:28

Merge branch 'main' into ig/accelerate_map_blocks

c7056ac

fix: bound asv

3080c8a

Merge branch 'main' into ig/accelerate_map_blocks

e2da061

ilan-gold added benchmark and removed benchmark labels Sep 29, 2025

Merge branch 'main' into ig/accelerate_map_blocks

9232d12

ilan-gold added benchmark and removed benchmark labels Sep 29, 2025

ilan-gold added 4 commits September 29, 2025 17:05

Merge branch 'main' into ig/accelerate_map_blocks

b2136f3

Merge branch 'main' into ig/accelerate_map_blocks

ac82887

fix: use uuids and guarnateed-to-exist properties

01d05e2

Merge branch 'main' into ig/accelerate_map_blocks

0f94a19

ilan-gold marked this pull request as ready for review October 1, 2025 15:53

feat: add concat sparse benchmarks

59041d4

flying-sheep reviewed Oct 2, 2025

View reviewed changes

src/anndata/_io/specs/registry.py Outdated Show resolved Hide resolved

ilan-gold added 3 commits October 2, 2025 13:32

fix: benchmark read and concat separate

fdac470

fix: revert name for lazy_methods

5905961

Merge branch 'main' into ig/accelerate_map_blocks

dd8af4d

ilan-gold requested a review from flying-sheep October 2, 2025 14:00

chore: relnote

7a35269

ilan-gold force-pushed the ig/accelerate_map_blocks branch from 017b829 to c29d7f1 Compare October 19, 2025 10:21

ilan-gold added 2 commits October 19, 2025 12:25

Merge branch 'main' into ig/accelerate_map_blocks

93d8910

fix: check order

3bc4ee2

ilan-gold force-pushed the ig/accelerate_map_blocks branch from c29d7f1 to 3bc4ee2 Compare October 19, 2025 10:27

ilan-gold added 5 commits October 19, 2025 13:01

fix: use name for h5 make_chunk

81004c5

Merge branch 'main' into ig/accelerate_map_blocks

45afeec

chore: add check

46df20c

Merge branch 'ig/accelerate_map_blocks' of github.com:scverse/anndata…

908436d

… into ig/accelerate_map_blocks

fix: remove unnecessary name call

6ef0bf6

ilan-gold mentioned this pull request Oct 20, 2025

fix: remove read_dataset usage in read_lazy{_elem} stack #2158

Merged

3 tasks

ilan-gold added 4 commits October 21, 2025 12:41

fix: revert to compute

50e50c3

Merge branch 'main' into ig/accelerate_map_blocks

5f23bfb

chore: bring back read_dataset

c14fe72

Merge branch 'ig/accelerate_map_blocks' of github.com:scverse/anndata…

d30638f

… into ig/accelerate_map_blocks

ilan-gold removed the benchmark label Oct 21, 2025

ilan-gold modified the milestones: 0.12.4, 0.12.5 Oct 27, 2025

ilan-gold added 2 commits October 28, 2025 16:40

Merge branch 'main' into ig/accelerate_map_blocks

67d2188

perf: use setup_cache for SparseCSRDask benchmark

39cb3c3

ilan-gold added the benchmark label Oct 28, 2025

ilan-gold commented Oct 28, 2025

View reviewed changes

ilan-gold requested review from flying-sheep and removed request for flying-sheep October 28, 2025 17:02

flying-sheep approved these changes Oct 31, 2025

View reviewed changes

ilan-gold merged commit 41bc3b5 into main Oct 31, 2025
25 checks passed

ilan-gold deleted the ig/accelerate_map_blocks branch October 31, 2025 15:30

meeseeksmachine pushed a commit to meeseeksmachine/anndata that referenced this pull request Oct 31, 2025

Backport PR scverse#2121: perf: use name in map_blocks to bypass …

a4543a1

…tokenization

meeseeksmachine mentioned this pull request Oct 31, 2025

Backport PR #2121 on branch 0.12.x (perf: use name in map_blocks to bypass tokenization) #2191

Merged

ilan-gold added a commit that referenced this pull request Nov 2, 2025

Backport PR #2121 on branch 0.12.x (perf: use name in map_blocks …

e24024f

…to bypass tokenization) (#2191) Co-authored-by: Ilan Gold <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: use `name` in `map_blocks` to bypass tokenization #2121

perf: use `name` in `map_blocks` to bypass tokenization #2121

Uh oh!

ilan-gold commented Sep 16, 2025

Uh oh!

codecov bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

scverse-benchmark bot commented Sep 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

ilan-gold Oct 28, 2025 •

edited

Loading

Uh oh!

ilan-gold commented Oct 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	da_mtx = da.map_blocks(
	make_chunk,
	dtype=dtype,
	chunks=chunk_layout,
	meta=memory_format((0, 0), dtype=dtype),
	name=f"{uuid.uuid4()}/{path_or_sparse_dataset}/{elem_name}-{dtype}",
	)

perf: use name in map_blocks to bypass tokenization #2121

perf: use name in map_blocks to bypass tokenization #2121

Uh oh!

Conversation

ilan-gold commented Sep 16, 2025

Uh oh!

codecov bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scverse-benchmark bot commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark changes

Uh oh!

Uh oh!

ilan-gold Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ilan-gold commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: use `name` in `map_blocks` to bypass tokenization #2121

perf: use `name` in `map_blocks` to bypass tokenization #2121

codecov bot commented Sep 16, 2025 •

edited

Loading

scverse-benchmark bot commented Sep 16, 2025 •

edited

Loading

ilan-gold Oct 28, 2025 •

edited

Loading

ilan-gold commented Oct 28, 2025 •

edited

Loading