Skip to content

scanpy-related local modules use harmonized Seqera image#271

Open
kim-fehl wants to merge 15 commits intonf-core:devfrom
kim-fehl:containers
Open

scanpy-related local modules use harmonized Seqera image#271
kim-fehl wants to merge 15 commits intonf-core:devfrom
kim-fehl:containers

Conversation

@kim-fehl
Copy link

@kim-fehl kim-fehl commented Mar 16, 2026

Harmonizes Python-heavy local module stack onto one Seqera image (harmonypy_anndata_leidenalg_numpy_pruned:43066d5f86f18261) to reduce duplicate image pulls and disk overhead.

channels:
- conda-forge
- bioconda
dependencies:
- bioconda::harmonypy=0.2.0
- conda-forge::anndata=0.12.10
- conda-forge::leidenalg=0.11.0
- conda-forge::numpy=2.3.5
- conda-forge::python-igraph=1.0.0
- conda-forge::python=3.13.12
- conda-forge::pyyaml=6.0.3
- conda-forge::scanpy=1.12
- conda-forge::upsetplot=0.9.0
- pip
- pip:
  - symphonypy==0.2.2
  • Keeps modules/nf-core/** unchanged; this PR only updates local modules and local subworkflow fallout.
  • Python version was pinned to 3.13 since using 3.12 introduced read_pickle()/Categorical... not implemented issue in pandas
  • numpy was pinned to 2.3.5 since 2.4 introduces major changes breaking upsetplot for now
  • At first, I tried to include bbknn into harmonized image, but reverted to the dedicated bbknn_pyyaml_scanpy image since bbknn requires Python 3.12
  • Aligns local SCVITOOLS_SCVI and SCVITOOLS_SCANVI to the same scvi-tools:1.3.3 image family already used by nf-core SOLO and SCAR. This version can be further updated uniformly in nf-core and local modules.
  • Replaces legacy custom YAML-writing helpers with yaml.dump(...) in touched local Python modules.
  • Refactors ADATA_UPSETGENES to read .h5ad via anndata directly (not via scanpy), since it only needs gene names (suggested by Codex). This lighter dependency can be relevant in future if it will be converted to nf-core/module.
  • Fixes local Harmony compatibility with newer harmonypy by handling both possible Z_corr orientations (see )
  • Snapshot updates for affected modules/subworkflows

Containers disk usage reduced from 60 to 45 Gb

  • R-related and custom docker hub images have potential for further optimization.
Image Disk usage Content size
SEQERA: anndata:0.10.9--1eab54e300e1e584 900 MB 195 MB
SEQERA: anndata2ri_bioconductor-singlecellexperiment_anndata_r-seurat:5fae42aabf7a1c5f 3.34 GB 810 MB
SEQERA: bbknn_pyyaml_scanpy:4cf2984722da607f 1.57 GB 346 MB
SEQERA: bioconductor-anndatar_bioconductor-rhdf5_bioconductor-singlecellexperiment:b7b9571d025f377e 2.27 GB 552 MB
SEQERA: bioconductor-celldex_bioconductor-hdf5array_bioconductor-singlecellexperiment_r-yaml:13bf33457e3e7490 2.29 GB 598 MB
SEQERA: celltypist_scanpy:44b604b24dd4cf33 1.74 GB 395 MB
SEQERA: harmonypy_anndata_leidenalg_numpy_pruned:43066d5f86f18261 3.26 GB 723 MB
SEQERA: liana_pyyaml:776fdd7103df146d 2.20 GB 471 MB
SEQERA: multiqc:1.33--ee7739d47738383b 1.97 GB 396 MB
SEQERA: mygene_anndata_pyyaml:d9454f09fb1f98d5 1.06 GB 234 MB
SEQERA: pip_hugo-unifier:bedd626d591c5003 1.31 GB 336 MB
SEQERA: python_pyyaml_scanpy_scikit-image:750e7b74b6d036e4 2.01 GB 473 MB
SEQERA: pyyaml_pip_doubletdetection:5af145ffec01d7da 1.83 GB 504 MB
SEQERA: scvi-tools:1.3.3--df115aabdccb7d6b 4.62 GB 1.04 GB
nicotru/celda:1d48a68e9d534b2b 2.91 GB 722 MB
nicotru/scds:7788dbeb87bc7eec 2.44 GB 614 MB
saditya88/singler:0.0.1 9.13 GB 2.64 GB
TOTAL 44.85 GB 11.05 GB

Non-version MD5 changes

These are the files with real payload drift:

modules/local/adata/entropy/tests/main.nf.test.snap
modules/local/adata/upsetgenes/tests/main.nf.test.snap
modules/local/doublet_detection/doublet_removal/tests/main.nf.test.snap
modules/local/scanpy/bbknn/tests/main.nf.test.snap
modules/local/scanpy/combat/tests/main.nf.test.snap
modules/local/scanpy/filter/tests/main.nf.test.snap
modules/local/scanpy/harmony/tests/main.nf.test.snap
modules/local/scanpy/leiden/tests/main.nf.test.snap
modules/local/scanpy/neighbors/tests/main.nf.test.snap
modules/local/scanpy/paga/tests/main.nf.test.snap
modules/local/scanpy/pca/tests/main.nf.test.snap
modules/local/scanpy/plotqc/tests/main.nf.test.snap
modules/local/scanpy/rankgenesgroups/tests/main.nf.test.snap
modules/local/scanpy/umap/tests/main.nf.test.snap
subworkflows/local/cluster/tests/main.nf.test.snap
subworkflows/local/integrate/tests/main.nf.test.snap
subworkflows/local/quality_control/tests/main.nf.test.snap

Why they changed (via Codex)

  • SCANPY_HARMONY is a real behavioral change, not just a version bump. In modules/local/scanpy/harmony/templates/harmony.py, the implementation switched from scanpy.external.pp.harmony_integrate(...) to direct harmonypy.run_harmony(...) with shape handling for the Z_corr orientation change. That explains the changed test.h5ad and X_test.pkl, and the downstream changes in subworkflows/local/integrate/tests/main.nf.test.snap.
  • Most of the other changed analysis payloads are consistent with the local stack uplift from older scanpy 1.11.x / older images to the shared py313 image with scanpy 1.12, anndata 0.12.10, pandas 2.3.3, numpy 2.3.5, pyyaml 6.0.3, and harmonypy 0.2.0. That affects graph construction, PCA/UMAP embeddings, Leiden clustering, PAGA connectivities, Combat outputs, and sometimes H5AD serialization details. That explains the drift in pca, neighbors, leiden, paga, umap, combat, entropy, rankgenesgroups, filter, and the workflow-level cluster snapshots.
  • ADATA_UPSETGENES has a direct code change in modules/local/adata/upsetgenes/templates/upsetplot.py: it now reads gene names through anndata instead of scanpy, and it also moved to a newer anndata/upsetplot/matplotlib stack. The changed png is expected. The changed mqc.json is also expected because that JSON embeds the image as base64, so any pixel-level plot change changes the JSON MD5 too.
  • modules/local/doublet_detection/doublet_removal/tests/main.nf.test.snap only shows test_mqc.json drift because that JSON also embeds an upset plot image. The direct code change there was only YAML writing, but the module moved onto the newer anndata/upsetplot runtime, so the rendered image changed and the base64-wrapped JSON changed with it.
  • modules/local/scanpy/plotqc/tests/main.nf.test.snap changed in both the PNG and the MultiQC JSON for the same reason: newer scanpy/matplotlib renders a slightly different QC scatter plot, and the JSON contains the image payload.
  • modules/local/scanpy/bbknn/tests/main.nf.test.snap still has a real h5ad change even after being split back out. That module now pins python=3.12.12 in modules/local/scanpy/bbknn/environment.yml and uses the existing shared bbknn_pyyaml_scanpy image. There was no code change, so this is almost certainly runtime/library/serialization drift from the tested image baseline rather than a logic regression.
  • subworkflows/local/quality_control/tests/main.nf.test.snap changed only in test_raw_mqc.json / test_preprocessed_mqc.json. Those are inherited from changed image-based plot outputs in SCANPY_PLOTQC, and possibly from the changed doublet-removal upset plot when that path is enabled. They are downstream effects, not new workflow logic.

Testing

Affected modules and workflows were checked with nf-test --update-snapshot, as well as full nextflow run . -profile test,docker.

I also visually inspected multiqc reports (left: before, right: after). Seems that just one cell could have influence...
image

image

Of course all other plots differed but mostly they were similar / up to cluster labels. I guess this caused most of the payloads' md5 discrepancies...

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/scdownstream branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@kim-fehl kim-fehl requested a review from nictru as a code owner March 16, 2026 21:43
@kim-fehl kim-fehl changed the title Containers scanpy-related local modules use harmonized Seqera image Mar 16, 2026
Copy link
Collaborator

@nictru nictru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, but I think you still need to update many subworkflows and pipeline snapshots to accommodate the new versions. Let me know if I should run that on my HPC.

Also, before merging I will think a bit about wether it might be justifiable in this case to create a common label, e.g. process_scanpy and then have a single environment definition that is applied to all processes with this label. It would remove a lot of redundant env definitions, but I am not sure if it is 100% in line with the guidelines

@kim-fehl
Copy link
Author

I think you still need to update many subworkflows and pipeline snapshots to accommodate the new versions. Let me know if I should run that on my HPC.

I used your refresh.sh script but it only updated these files on top of already commited:

tests/default.nf.test.snap
tests/main_pipeline_build.nf.test.snap
tests/main_pipeline_extend.nf.test.snap
tests/main_pipeline_qc.nf.test.snap
tests/main_pipeline_reference_mapping.nf.test.snap
tests/main_pipeline_sub.nf.test.snap

Althought some CI tests failures are in ADATA_ENTROPY, that's strange. Could it be related to caching issues you wrote about in Slack?

The CI will initially fail because the global test snapshot a list of expected output files, and changes like adding a new integration method will trigger a mismatch. Updating the caches requires running the test profile ~5 times, as we are testing many different configuration combinations of the pipeline to ensure all relevant use cases are working properly and reproducibly.

Regarding HPC suggestion -- you mean to also update snapshots for the test_full configuration?

Also, I'm interested, what are these 25 runs, it's not obvious from the CI workflow configuration file: test matrix seems smaller...

@nictru
Copy link
Collaborator

nictru commented Mar 17, 2026

The refresh.sh runs only the pipeline-level tests and updates the snapshots if there are changes, so what you see is expected. It does not do anything about subworkflow tests. I did it this way, because with subworkflow one mostly knows which one needs to be updated, for the pipeline tests it's mostly quicker to just re-run all of them


About ADATA_ENTROPY: This is an interesting thing that I also have seen before. Currently, my suspicion is that it is because of floating point inconsistencies that occur in certain situations. Interestingly, when I see ADATA_ENTROPY failing in the CI and click on 'Re-run failed jobs' it always passes. Now the problem is that it is super hard to find out what is actually the reason for the different hashes, as we can't get the output files that are produced during the CI, and even when running on GitHub codespaces this never happens.

But I was anyway planning to make the checks for certain modules, including this one, a bit more lenient, so I hope this will not happen anymore after these changes too. For now, using the 're-run' button should do the jobs.


The 25 CI jobs are different shards produced by nf-test. Basically, if we say there are 100 tests in the pipeline, it will split them into 25 batches of 4, and run each batch in parallel - this is way faster than running the 100 one after another. But the set of executed tests stays the same.


I think I will try to finish this off for you, so that we can close some more of the currently open PRs. I think some of the issues are also due to cross-system inconsistencies that Erik pointed out previously, but I can't fix them before merging this PR, so it will be really hard for you to get this sorted anyway. But I will try to make this more easy to fulfill after the currently open PRs are handled. So far, there have not been many contributors, so it wasn't really an issue, but this has changed in the last weeks

@nictru
Copy link
Collaborator

nictru commented Mar 17, 2026

Okay so it seems like the new package version intensified the numerical inconsistencies across systems, so it seems like we can't ignore it anymore. I will add a fix for this directly to this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants