Slim container images for #161#200
Open
werner291 wants to merge 2 commits intoRIVM-bioinformatics:mainfrom
Open
Slim container images for #161#200werner291 wants to merge 2 commits intoRIVM-bioinformatics:mainfrom
werner291 wants to merge 2 commits intoRIVM-bioinformatics:mainfrom
Conversation
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Hi! Saw #161 was open and took a look. Conservative on purpose: only changes that can't plausibly affect runtime behaviour. Sizes below come from fresh same-day builds of upstream
mainand this branch against the samemambaorg/micromamba:latestdigest; build commands and environment are at the bottom under "Reproducing these numbers."What's eating the space
/opt/conda/x86_64-conda-linux-gnupython=3.12, never used at runtime/opt/conda/pkgs/opt/conda/includeapt-get install adduserchainadduseruseraddmicromamba install gitgit+...pipshare/doc+share/man+share/infolib/jvm/lib/src.zipmultiqc's opt-in extraskaleido(static plot export),tiktoken(AI hints),spectra(advanced colour spaces)fastqc's bundled JDKpip+setuptools+wheelpkg_resourcesA note on the
x86_64-conda-linux-gnurow: it only shows up on Alignment because that's the only env yaml pinningpython=3.12, which drags the toolchain in. The others usepython=3.10and don't.Footnote on
pkg_resources: setuptools >=82 stopped shipping it as a top-level package, and the baseline images already don't expose it that way; if anything in the installed dep tree imported it at import time, the baseline would already be broken. So removing pip/setuptools/wheel is probably safe. Kept anyway as a precaution, since something could still lazy-importpkg_resourcesonly on a code path I haven't walked.What this PR removes (safe)
The marked rows above. None of these can affect what any tool does at runtime:
useraddinstead ofapt-get install adduser. Same end state, no apt index round-trip.micromamba install gitin the 5 envs that have nogit+...pip URL.rm -rf /opt/conda/{pkgs,x86_64-conda-linux-gnu,include}andshare/{doc,man,info}.rm -f /opt/conda/lib/jvm/lib/src.zipon the two containers that ship a JDK (Clean, core_scripts). Pure source archive for IDE step-through; FastQC has no code path that reads it.A note on
rm -rf /opt/conda/pkgs:pkgs/is the conda package cache. After install, files are referenced from the live env (lib/,bin/, ...) via copies or hardlinks, and conda's "what's installed" metadata lives in/opt/conda/conda-meta/, not inpkgs/. So removingpkgs/only breaks things if something in the live env still points back into it (a stray symlink, a hardcoded path in a wrapper script). Spot-checked across the Alignment / Clean / core_scripts baselines, both checks come back zero. Conda-forge's official miniforge Docker image does the equivalent (conda clean --force-pkgs-dirs --all --yes).Sizes:
I
--version-checked each slim image on its primary tool. Didn't find container-level tests in the repo beyondtests/e2e/test_e2e.py, which I haven't run.5 of 6 containers.
containers/Consensus.dockerfiledoesn't build cleanly onmaintoday (gcc missing while pip-compiling biopython during the TrueConsense install); will file separately.Looked at, didn't do
Each of these would save more, but each could break a lazy-import or read-only-fs code path I haven't ruled out:
__pycache__(~120 MB on python-heavy envs). Under singularity / read-only.sif, packages can't regenerate.pycon first import; some packages that lazy-import a side-module hit a write-attempt and slow down or warn.tests/directories (~55 MB). A small number of python packages legitimately import from their owntests/at runtime.share/locale(~15 MB). Tools usinglocale.genmay behave subtly differently.*.a/*.lastatic archives. Low risk, but assumes nothing JIT-links at runtime.Bigger changes (not in this PR)
The unmarked rows above are where the biggest remaining bulk lives. None of them can be trimmed from a Dockerfile alone because they live inside conda-managed packages; each needs a patched recipe or a different upstream choice:
kaleido/tiktoken/spectra. Worth it only if multiqc's HTML report is what you actually use; AWS / parquet / static-image users would lose features._polars_runtime_compatand_polars_runtime_32), each ~205 MB; deleting one is a one-liner. OpenBLAS and parasail need a recipe-level recompile.src.zipis gone): either a stripped JRE built withjlinkagainst the modules FastQC actually uses, shipped via a custom recipe, or a different conda recipe altogether. Note that the two containers ship different major JDK versions (openjdk 11 on Clean, openjdk 25 on core_scripts), driven by what each env yaml pulls in.pkg_resourcesis shipped by setuptools and is imported by various scientific Python packages. See footnote on the table.Other findings while preparing this PR
Off-topic for the size question but turned up while poking at the container pipeline. Each could be something you already deal with; flagged in case any is news. Happy to file separately or skip:
Consensus.dockerfileappears not to build cleanly onmaintoday. In my reproduction,pip install git+...TrueConsensefails withgcc: No such file or directorywhile compiling biopython from source. The published.sifmay keep working becausecontainers/build_containers.py:50-52skips builds when the env hash is already in the upstream registry (if VersionHash in tags: ... continue).tests/e2e/test_e2e.pymay not import cleanly on a fresh install. The chaintests/e2e/test_e2e.py→ViroConstrictor.__main__→workflow_executor→workflow_configends infrom snakemake.resources import DefaultResourcesatworkflow_config.py:23. That class has been removed fromsnakemake.resourcesin newer snakemake (gone in v9.20, present in v9.5).workflow.smkdeclaresmin_version("9.5")andenv.ymlpinssnakemake-minimal==9.5.*, so as long as you install from the pinned env you're fine; apip install ViroConstrictoroutside that env would resolve a newer snakemake and the import breaks..github/workflows/build_and_test.ymlis titled "build containers and run tests" but the steps build, zip, download apptainer artefacts, install ViroConstrictor, and end with the comment## rest of the testing suite here. There's no step that actually runs anything against the result. Possible I missed a chained workflow.pkg_resourcesis no longer importable as a top-level package in the current baseline images (setuptools >=82 dropped it). See footnote on the table above.Going further: actual reproducibility
This PR is about size, not reproducibility, but they come up together. Absolute MB drifts between rebuilds (the appendix explains why); the saved-percent is what's stable. If you want to reduce that dep-version churn, the change is separate from this PR and roughly two steps:
FROM mambaorg/micromamba@sha256:...instead of:latest.micromamba install -f /install.ymlwithmicromamba create -f /lock.lock, wherelock.lockismicromamba env export --explicitchecked into the repo and bumped on dep updates.That stops bioconda from re-solving transitive deps on every rebuild and keeps the base image fixed, so two rebuilds a month apart install the same package versions. Image bytes still won't be bit-identical (
.pycmtimes, BuildKit layer hashes, gzip/tar timestamps, the pip-install-from-git in Consensus) but those are smaller, separately-tractable issues.Happy to address feedback or split this PR into per-container commits.
Werner