scanpy-related local modules use harmonized Seqera image by kim-fehl · Pull Request #271 · nf-core/scdownstream

kim-fehl · 2026-03-16T21:43:58Z

Harmonizes Python-heavy local module stack onto one Seqera image (harmonypy_anndata_leidenalg_numpy_pruned:43066d5f86f18261) to reduce duplicate image pulls and disk overhead.

channels:
- conda-forge
- bioconda
dependencies:
- bioconda::harmonypy=0.2.0
- conda-forge::anndata=0.12.10
- conda-forge::leidenalg=0.11.0
- conda-forge::numpy=2.3.5
- conda-forge::python-igraph=1.0.0
- conda-forge::python=3.13.12
- conda-forge::pyyaml=6.0.3
- conda-forge::scanpy=1.12
- conda-forge::upsetplot=0.9.0
- pip
- pip:
  - symphonypy==0.2.2

Keeps modules/nf-core/** unchanged; this PR only updates local modules and local subworkflow fallout.
Python version was pinned to 3.13 since using 3.12 introduced read_pickle()/Categorical... not implemented issue in pandas
numpy was pinned to 2.3.5 since 2.4 introduces major changes breaking upsetplot for now
At first, I tried to include bbknn into harmonized image, but reverted to the dedicated bbknn_pyyaml_scanpy image since bbknn requires Python 3.12
Aligns local SCVITOOLS_SCVI and SCVITOOLS_SCANVI to the same scvi-tools:1.3.3 image family already used by nf-core SOLO and SCAR. This version can be further updated uniformly in nf-core and local modules.
Replaces legacy custom YAML-writing helpers with yaml.dump(...) in touched local Python modules.
Refactors ADATA_UPSETGENES to read .h5ad via anndata directly (not via scanpy), since it only needs gene names (suggested by Codex). This lighter dependency can be relevant in future if it will be converted to nf-core/module.
Fixes local Harmony compatibility with newer harmonypy by handling both possible Z_corr orientations (see )
Snapshot updates for affected modules/subworkflows

Containers disk usage reduced from 60 to 45 Gb

R-related and custom docker hub images have potential for further optimization.

Image	Disk usage	Content size
`SEQERA: anndata:0.10.9--1eab54e300e1e584`	900 MB	195 MB
`SEQERA: anndata2ri_bioconductor-singlecellexperiment_anndata_r-seurat:5fae42aabf7a1c5f`	3.34 GB	810 MB
`SEQERA: bbknn_pyyaml_scanpy:4cf2984722da607f`	1.57 GB	346 MB
`SEQERA: bioconductor-anndatar_bioconductor-rhdf5_bioconductor-singlecellexperiment:b7b9571d025f377e`	2.27 GB	552 MB
`SEQERA: bioconductor-celldex_bioconductor-hdf5array_bioconductor-singlecellexperiment_r-yaml:13bf33457e3e7490`	2.29 GB	598 MB
`SEQERA: celltypist_scanpy:44b604b24dd4cf33`	1.74 GB	395 MB
`SEQERA: harmonypy_anndata_leidenalg_numpy_pruned:43066d5f86f18261`	3.26 GB	723 MB
`SEQERA: liana_pyyaml:776fdd7103df146d`	2.20 GB	471 MB
`SEQERA: multiqc:1.33--ee7739d47738383b`	1.97 GB	396 MB
`SEQERA: mygene_anndata_pyyaml:d9454f09fb1f98d5`	1.06 GB	234 MB
`SEQERA: pip_hugo-unifier:bedd626d591c5003`	1.31 GB	336 MB
`SEQERA: python_pyyaml_scanpy_scikit-image:750e7b74b6d036e4`	2.01 GB	473 MB
`SEQERA: pyyaml_pip_doubletdetection:5af145ffec01d7da`	1.83 GB	504 MB
`SEQERA: scvi-tools:1.3.3--df115aabdccb7d6b`	4.62 GB	1.04 GB
`nicotru/celda:1d48a68e9d534b2b`	2.91 GB	722 MB
`nicotru/scds:7788dbeb87bc7eec`	2.44 GB	614 MB
`saditya88/singler:0.0.1`	9.13 GB	2.64 GB
TOTAL	44.85 GB	11.05 GB

Non-version MD5 changes

These are the files with real payload drift:

modules/local/adata/entropy/tests/main.nf.test.snap
modules/local/adata/upsetgenes/tests/main.nf.test.snap
modules/local/doublet_detection/doublet_removal/tests/main.nf.test.snap
modules/local/scanpy/bbknn/tests/main.nf.test.snap
modules/local/scanpy/combat/tests/main.nf.test.snap
modules/local/scanpy/filter/tests/main.nf.test.snap
modules/local/scanpy/harmony/tests/main.nf.test.snap
modules/local/scanpy/leiden/tests/main.nf.test.snap
modules/local/scanpy/neighbors/tests/main.nf.test.snap
modules/local/scanpy/paga/tests/main.nf.test.snap
modules/local/scanpy/pca/tests/main.nf.test.snap
modules/local/scanpy/plotqc/tests/main.nf.test.snap
modules/local/scanpy/rankgenesgroups/tests/main.nf.test.snap
modules/local/scanpy/umap/tests/main.nf.test.snap
subworkflows/local/cluster/tests/main.nf.test.snap
subworkflows/local/integrate/tests/main.nf.test.snap
subworkflows/local/quality_control/tests/main.nf.test.snap

Why they changed (via Codex)

SCANPY_HARMONY is a real behavioral change, not just a version bump. In modules/local/scanpy/harmony/templates/harmony.py, the implementation switched from scanpy.external.pp.harmony_integrate(...) to direct harmonypy.run_harmony(...) with shape handling for the Z_corr orientation change. That explains the changed test.h5ad and X_test.pkl, and the downstream changes in subworkflows/local/integrate/tests/main.nf.test.snap.
Most of the other changed analysis payloads are consistent with the local stack uplift from older scanpy 1.11.x / older images to the shared py313 image with scanpy 1.12, anndata 0.12.10, pandas 2.3.3, numpy 2.3.5, pyyaml 6.0.3, and harmonypy 0.2.0. That affects graph construction, PCA/UMAP embeddings, Leiden clustering, PAGA connectivities, Combat outputs, and sometimes H5AD serialization details. That explains the drift in pca, neighbors, leiden, paga, umap, combat, entropy, rankgenesgroups, filter, and the workflow-level cluster snapshots.
ADATA_UPSETGENES has a direct code change in modules/local/adata/upsetgenes/templates/upsetplot.py: it now reads gene names through anndata instead of scanpy, and it also moved to a newer anndata/upsetplot/matplotlib stack. The changed png is expected. The changed mqc.json is also expected because that JSON embeds the image as base64, so any pixel-level plot change changes the JSON MD5 too.
modules/local/doublet_detection/doublet_removal/tests/main.nf.test.snap only shows test_mqc.json drift because that JSON also embeds an upset plot image. The direct code change there was only YAML writing, but the module moved onto the newer anndata/upsetplot runtime, so the rendered image changed and the base64-wrapped JSON changed with it.
modules/local/scanpy/plotqc/tests/main.nf.test.snap changed in both the PNG and the MultiQC JSON for the same reason: newer scanpy/matplotlib renders a slightly different QC scatter plot, and the JSON contains the image payload.
modules/local/scanpy/bbknn/tests/main.nf.test.snap still has a real h5ad change even after being split back out. That module now pins python=3.12.12 in modules/local/scanpy/bbknn/environment.yml and uses the existing shared bbknn_pyyaml_scanpy image. There was no code change, so this is almost certainly runtime/library/serialization drift from the tested image baseline rather than a logic regression.
subworkflows/local/quality_control/tests/main.nf.test.snap changed only in test_raw_mqc.json / test_preprocessed_mqc.json. Those are inherited from changed image-based plot outputs in SCANPY_PLOTQC, and possibly from the changed doublet-removal upset plot when that path is enabled. They are downstream effects, not new workflow logic.

Testing

Affected modules and workflows were checked with nf-test --update-snapshot, as well as full nextflow run . -profile test,docker.

I also visually inspected multiqc reports (left: before, right: after). Seems that just one cell could have influence...

Of course all other plots differed but mostly they were similar / up to cluster labels. I guess this caused most of the payloads' md5 discrepancies...

PR checklist

…eps" This reverts commit 0e47b8a.

…dules

…ategorical issue, exclude bbknn requiring py3.12

…tools 1.3.3 image

nictru

Looks good in general, but I think you still need to update many subworkflows and pipeline snapshots to accommodate the new versions. Let me know if I should run that on my HPC.

Also, before merging I will think a bit about wether it might be justifiable in this case to create a common label, e.g. process_scanpy and then have a single environment definition that is applied to all processes with this label. It would remove a lot of redundant env definitions, but I am not sure if it is 100% in line with the guidelines

modules/local/scvitools/scanvi/environment.yml

kim-fehl · 2026-03-17T12:49:58Z

I think you still need to update many subworkflows and pipeline snapshots to accommodate the new versions. Let me know if I should run that on my HPC.

I used your refresh.sh script but it only updated these files on top of already commited:

tests/default.nf.test.snap
tests/main_pipeline_build.nf.test.snap
tests/main_pipeline_extend.nf.test.snap
tests/main_pipeline_qc.nf.test.snap
tests/main_pipeline_reference_mapping.nf.test.snap
tests/main_pipeline_sub.nf.test.snap

Althought some CI tests failures are in ADATA_ENTROPY, that's strange. Could it be related to caching issues you wrote about in Slack?

The CI will initially fail because the global test snapshot a list of expected output files, and changes like adding a new integration method will trigger a mismatch. Updating the caches requires running the test profile ~5 times, as we are testing many different configuration combinations of the pipeline to ensure all relevant use cases are working properly and reproducibly.

Regarding HPC suggestion -- you mean to also update snapshots for the test_full configuration?

Also, I'm interested, what are these 25 runs, it's not obvious from the CI workflow configuration file: test matrix seems smaller...

nictru · 2026-03-17T21:07:55Z

The refresh.sh runs only the pipeline-level tests and updates the snapshots if there are changes, so what you see is expected. It does not do anything about subworkflow tests. I did it this way, because with subworkflow one mostly knows which one needs to be updated, for the pipeline tests it's mostly quicker to just re-run all of them

About ADATA_ENTROPY: This is an interesting thing that I also have seen before. Currently, my suspicion is that it is because of floating point inconsistencies that occur in certain situations. Interestingly, when I see ADATA_ENTROPY failing in the CI and click on 'Re-run failed jobs' it always passes. Now the problem is that it is super hard to find out what is actually the reason for the different hashes, as we can't get the output files that are produced during the CI, and even when running on GitHub codespaces this never happens.

But I was anyway planning to make the checks for certain modules, including this one, a bit more lenient, so I hope this will not happen anymore after these changes too. For now, using the 're-run' button should do the jobs.

The 25 CI jobs are different shards produced by nf-test. Basically, if we say there are 100 tests in the pipeline, it will split them into 25 batches of 4, and run each batch in parallel - this is way faster than running the 100 one after another. But the set of executed tests stays the same.

I think I will try to finish this off for you, so that we can close some more of the currently open PRs. I think some of the issues are also due to cross-system inconsistencies that Erik pointed out previously, but I can't fix them before merging this PR, so it will be really hard for you to get this sorted anyway. But I will try to make this more easy to fulfill after the currently open PRs are handled. So far, there have not been many contributors, so it wasn't really an issue, but this has changed in the last weeks

nictru · 2026-03-17T21:50:13Z

Okay so it seems like the new package version intensified the numerical inconsistencies across systems, so it seems like we can't ignore it anymore. I will add a fix for this directly to this issue

kim-fehl and others added 11 commits March 16, 2026 14:16

refactor(singler): switch to seqera bioc container and trim deps

0e47b8a

Revert "refactor(singler): switch to seqera bioc container and trim d…

a59857f

…eps" This reverts commit 0e47b8a.

refactor(containers): harmonize containers of local scanpy/anndata mo…

b272e33

…dules

refactor(containers): pin numpy=2.3.5 since 2.4 breaks upsetplot for now

2d2caaf

refactor(containers): upgrade python=3.13.12 because of read_pickle/C…

315f5a9

…ategorical issue, exclude bbknn requiring py3.12

fix(harmony): account for harmonypy>=0.2.0 Z_corr breaking change

9564683

test(containers): refresh nf-test snapshots

2ab85b1

refactor(collectsizes): use harmonized container with pandas and pyyaml

ce358fe

test(containers): refresh nf-test snapshots for collectsizes

932c76b

refactor(scvi): align local scvi modules to use nf-core/modules scvi-…

ee82013

…tools 1.3.3 image

test(containers): refresh nf-test scvitools snapshots

9876c41

kim-fehl requested a review from nictru as a code owner March 16, 2026 21:43

kim-fehl changed the title ~~Containers~~ scanpy-related local modules use harmonized Seqera image Mar 16, 2026

nictru reviewed Mar 16, 2026

View reviewed changes

modules/local/scvitools/scanvi/environment.yml Show resolved Hide resolved

test(containers): refresh main_pipeline_* and default.nf snapshots

f01e8d2

Update test snapshots

5832518

nictru added 2 commits March 17, 2026 23:04

Add rounding to adata/entropy as a test ballon

3103feb

Add rounding to more modules

53df8b5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scanpy-related local modules use harmonized Seqera image#271

scanpy-related local modules use harmonized Seqera image#271
kim-fehl wants to merge 15 commits intonf-core:devfrom
kim-fehl:containers

kim-fehl commented Mar 16, 2026 •

edited

Loading

Uh oh!

nictru left a comment

Uh oh!

Uh oh!

kim-fehl commented Mar 17, 2026

Uh oh!

nictru commented Mar 17, 2026

Uh oh!

nictru commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kim-fehl commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Containers disk usage reduced from 60 to 45 Gb

Non-version MD5 changes

Testing

PR checklist

Uh oh!

nictru left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kim-fehl commented Mar 17, 2026

Uh oh!

nictru commented Mar 17, 2026

Uh oh!

nictru commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kim-fehl commented Mar 16, 2026 •

edited

Loading