-
Notifications
You must be signed in to change notification settings - Fork 174
Add batch_converter hook + multiprocessing support to experimental.pytorch.AnnLoader #2135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ronamit
wants to merge
15
commits into
scverse:main
Choose a base branch
from
ronamit:feat/worker-safe-batch-converter
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add batch_converter parameter for advanced batch-level post-processing - Enable multiprocessing (num_workers>0) via AnnCollectionView pickling - Implement worker-side batch conversion for true parallelism - Add comprehensive tests for both single and multi-threaded modes - Include helper batch_dict_converter for common dict format - All tests pass, pre-commit hooks clean, backward compatible Solves two major AnnLoader limitations in unified implementation: 1. Batch-level transformation (vs element-wise convert) 2. Multiprocessing support (was broken due to unpicklable AnnCollectionView) Enables production ML workflows with PyTorch Lightning integration, data augmentation, balanced sampling, and parallel data loading.
…pport - Fix AnnLoader docstring to remove incorrect multiprocessing limitation - Add batch_dict_converter to API documentation - Add release notes fragment for PR scverse#2135 - Document new batch_converter parameter and multiprocessing capabilities The batch_converter parameter now works seamlessly with both single-threaded and multi-threaded data loading, enabling faster PyTorch training workflows.
- Move tests from src/anndata/tests/pytorch/ to tests/pytorch/ - Follow standard anndata test organization pattern - Add __init__.py to make pytorch test package discoverable - Tests for batch_converter parameter and multiprocessing support
- Add conditional torch imports using find_spec() pattern in converters.py - Make batch_dict_converter import conditional with helpful error message - Add pytest.importorskip('torch') to PyTorch test files - Fix linting warnings by using += instead of .append() for __all__ This resolves CI test collection errors when torch is not available while maintaining full functionality when torch is installed.
The CI tests were failing due to DeprecationWarning: 'oneOf' deprecated - use 'one_of' warnings from pyparsing used by matplotlib. This warning is triggered when scanpy imports matplotlib during test execution. Added warning filters to pytest configuration to ignore this specific deprecation warning from the dependency chain, allowing tests to pass while preserving warnings from anndata's own code. Fixes the following failing tests: - test_scanpy_pbmc68k[zarr3/zarr2] - test_scanpy_krumsiek11[zarr3/zarr2] - test_read_partial_adata[zarr2/zarr3]
After fixing the 'oneOf' deprecation warnings, CI revealed another pyparsing deprecation warning: 'parseString' deprecated - use 'parse_string'. This is also coming from matplotlib's font configuration parsing. Added additional warning filter to suppress this specific deprecation warning from the dependency chain, ensuring all pyparsing-related deprecation warnings from matplotlib are properly handled in CI.
for more information, see https://pre-commit.ci
After fixing the 'oneOf' deprecation warnings, CI revealed another pyparsing deprecation warning: 'parseString' deprecated - use 'parse_string'. This is also coming from matplotlib's font configuration parsing. Added additional warning filter to suppress this specific deprecation warning from the dependency chain, ensuring all pyparsing-related deprecation warnings from matplotlib are properly handled in CI.
…/ronamit/anndata into feat/worker-safe-batch-converter
for more information, see https://pre-commit.ci
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add
batch_converter
hook + multiprocessing support toexperimental.pytorch.AnnLoader
This PR solves two major AnnLoader limitations:
batch_converter
parameter for advanced post-processingnum_workers > 0
(previously crashed due to unpicklableAnnCollectionView
)Motivation
AnnLoader
currently offers only element-wise converters viaconvert["X"]
mapping, insufficient for:dict
instead ofAnnCollectionView
.X
and.obs
Additionally,
num_workers > 0
has never worked, preventing parallel data loading in production workflows.Implementation
1.
batch_converter
parameterOptional callable applied to each batch before returning to user. Fully backward-compatible.
2. Multiprocessing support
__getstate__/__setstate__
hooks enableAnnCollectionView
pickling across worker processescollate_fn
3. Helper converter
batch_dict_converter
convertsAnnCollectionView
→dict[str, Tensor]
with"x"
key for.X
and keys for each.obs
column.Usage
Performance
Negligible overhead for single-threaded use. Primary benefit is enabling parallel data loading for I/O-bound workflows.
Testing
AnnCollectionView
)Backward compatible: Optional parameter defaults to
None
, no breaking changes.