scanpy-related local modules use harmonized Seqera image#271
scanpy-related local modules use harmonized Seqera image#271kim-fehl wants to merge 15 commits intonf-core:devfrom
Conversation
…eps" This reverts commit 0e47b8a.
…ategorical issue, exclude bbknn requiring py3.12
…tools 1.3.3 image
nictru
left a comment
There was a problem hiding this comment.
Looks good in general, but I think you still need to update many subworkflows and pipeline snapshots to accommodate the new versions. Let me know if I should run that on my HPC.
Also, before merging I will think a bit about wether it might be justifiable in this case to create a common label, e.g. process_scanpy and then have a single environment definition that is applied to all processes with this label. It would remove a lot of redundant env definitions, but I am not sure if it is 100% in line with the guidelines
I used your refresh.sh script but it only updated these files on top of already commited: Althought some CI tests failures are in ADATA_ENTROPY, that's strange. Could it be related to caching issues you wrote about in Slack?
Regarding HPC suggestion -- you mean to also update snapshots for the Also, I'm interested, what are these 25 runs, it's not obvious from the CI workflow configuration file: test matrix seems smaller... |
|
The About But I was anyway planning to make the checks for certain modules, including this one, a bit more lenient, so I hope this will not happen anymore after these changes too. For now, using the 're-run' button should do the jobs. The 25 CI jobs are different shards produced by nf-test. Basically, if we say there are 100 tests in the pipeline, it will split them into 25 batches of 4, and run each batch in parallel - this is way faster than running the 100 one after another. But the set of executed tests stays the same. I think I will try to finish this off for you, so that we can close some more of the currently open PRs. I think some of the issues are also due to cross-system inconsistencies that Erik pointed out previously, but I can't fix them before merging this PR, so it will be really hard for you to get this sorted anyway. But I will try to make this more easy to fulfill after the currently open PRs are handled. So far, there have not been many contributors, so it wasn't really an issue, but this has changed in the last weeks |
|
Okay so it seems like the new package version intensified the numerical inconsistencies across systems, so it seems like we can't ignore it anymore. I will add a fix for this directly to this issue |
Harmonizes Python-heavy local module stack onto one Seqera image (
harmonypy_anndata_leidenalg_numpy_pruned:43066d5f86f18261) to reduce duplicate image pulls and disk overhead.modules/nf-core/**unchanged; this PR only updates local modules and local subworkflow fallout.read_pickle()/Categorical... not implementedissue inpandasnumpywas pinned to 2.3.5 since 2.4 introduces major changes breakingupsetplotfor nowbbknninto harmonized image, but reverted to the dedicatedbbknn_pyyaml_scanpyimage sincebbknnrequires Python 3.12SCVITOOLS_SCVIandSCVITOOLS_SCANVIto the samescvi-tools:1.3.3image family already used by nf-coreSOLOandSCAR. This version can be further updated uniformly in nf-core and local modules.yaml.dump(...)in touched local Python modules.ADATA_UPSETGENESto read.h5adviaanndatadirectly (not viascanpy), since it only needs gene names (suggested by Codex). This lighter dependency can be relevant in future if it will be converted to nf-core/module.harmonypyby handling both possibleZ_corrorientations (see )Containers disk usage reduced from 60 to 45 Gb
SEQERA: anndata:0.10.9--1eab54e300e1e584SEQERA: anndata2ri_bioconductor-singlecellexperiment_anndata_r-seurat:5fae42aabf7a1c5fSEQERA: bbknn_pyyaml_scanpy:4cf2984722da607fSEQERA: bioconductor-anndatar_bioconductor-rhdf5_bioconductor-singlecellexperiment:b7b9571d025f377eSEQERA: bioconductor-celldex_bioconductor-hdf5array_bioconductor-singlecellexperiment_r-yaml:13bf33457e3e7490SEQERA: celltypist_scanpy:44b604b24dd4cf33SEQERA: harmonypy_anndata_leidenalg_numpy_pruned:43066d5f86f18261SEQERA: liana_pyyaml:776fdd7103df146dSEQERA: multiqc:1.33--ee7739d47738383bSEQERA: mygene_anndata_pyyaml:d9454f09fb1f98d5SEQERA: pip_hugo-unifier:bedd626d591c5003SEQERA: python_pyyaml_scanpy_scikit-image:750e7b74b6d036e4SEQERA: pyyaml_pip_doubletdetection:5af145ffec01d7daSEQERA: scvi-tools:1.3.3--df115aabdccb7d6bnicotru/celda:1d48a68e9d534b2bnicotru/scds:7788dbeb87bc7eecsaditya88/singler:0.0.1Non-version MD5 changes
These are the files with real payload drift:
Why they changed (via Codex)
SCANPY_HARMONYis a real behavioral change, not just a version bump. Inmodules/local/scanpy/harmony/templates/harmony.py, the implementation switched fromscanpy.external.pp.harmony_integrate(...)to directharmonypy.run_harmony(...)with shape handling for theZ_corrorientation change. That explains the changedtest.h5adandX_test.pkl, and the downstream changes insubworkflows/local/integrate/tests/main.nf.test.snap.scanpy 1.11.x/ older images to the shared py313 image withscanpy 1.12,anndata 0.12.10,pandas 2.3.3,numpy 2.3.5,pyyaml 6.0.3, andharmonypy 0.2.0. That affects graph construction, PCA/UMAP embeddings, Leiden clustering, PAGA connectivities, Combat outputs, and sometimes H5AD serialization details. That explains the drift inpca,neighbors,leiden,paga,umap,combat,entropy,rankgenesgroups,filter, and the workflow-levelclustersnapshots.ADATA_UPSETGENEShas a direct code change inmodules/local/adata/upsetgenes/templates/upsetplot.py: it now reads gene names throughanndatainstead ofscanpy, and it also moved to a neweranndata/upsetplot/matplotlibstack. The changedpngis expected. The changedmqc.jsonis also expected because that JSON embeds the image as base64, so any pixel-level plot change changes the JSON MD5 too.modules/local/doublet_detection/doublet_removal/tests/main.nf.test.snaponly showstest_mqc.jsondrift because that JSON also embeds an upset plot image. The direct code change there was only YAML writing, but the module moved onto the neweranndata/upsetplotruntime, so the rendered image changed and the base64-wrapped JSON changed with it.modules/local/scanpy/plotqc/tests/main.nf.test.snapchanged in both the PNG and the MultiQC JSON for the same reason: newerscanpy/matplotlibrenders a slightly different QC scatter plot, and the JSON contains the image payload.modules/local/scanpy/bbknn/tests/main.nf.test.snapstill has a realh5adchange even after being split back out. That module now pinspython=3.12.12inmodules/local/scanpy/bbknn/environment.ymland uses the existing sharedbbknn_pyyaml_scanpyimage. There was no code change, so this is almost certainly runtime/library/serialization drift from the tested image baseline rather than a logic regression.subworkflows/local/quality_control/tests/main.nf.test.snapchanged only intest_raw_mqc.json/test_preprocessed_mqc.json. Those are inherited from changed image-based plot outputs inSCANPY_PLOTQC, and possibly from the changed doublet-removal upset plot when that path is enabled. They are downstream effects, not new workflow logic.Testing
Affected modules and workflows were checked with
nf-test --update-snapshot, as well as fullnextflow run . -profile test,docker.I also visually inspected multiqc reports (left: before, right: after). Seems that just one cell could have influence...

Of course all other plots differed but mostly they were similar / up to cluster labels. I guess this caused most of the payloads' md5 discrepancies...
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).