Skip to content

Slim container images for #161#200

Open
werner291 wants to merge 2 commits intoRIVM-bioinformatics:mainfrom
werner291:experiment/161-container-slimming
Open

Slim container images for #161#200
werner291 wants to merge 2 commits intoRIVM-bioinformatics:mainfrom
werner291:experiment/161-container-slimming

Conversation

@werner291
Copy link
Copy Markdown

@werner291 werner291 commented May 3, 2026

Hi! Saw #161 was open and took a look. Conservative on purpose: only changes that can't plausibly affect runtime behaviour. Sizes below come from fresh same-day builds of upstream main and this branch against the same mambaorg/micromamba:latest digest; build commands and environment are at the bottom under "Reproducing these numbers."

What's eating the space

Source of bulk Size What it is This PR?
/opt/conda/x86_64-conda-linux-gnu 482 MB on Alignment, absent on the others cross-compile toolchain (gcc, binutils, sysroot); pulled in by python=3.12, never used at runtime removed
/opt/conda/pkgs 456 MB on Alignment up to 2.1 GB on Clean downloaded package tarballs, kept after install removed
/opt/conda/include 0.1 MB on Alignment, ~29 MB on Clean C/C++ headers, compile-time only removed
apt-get install adduser chain ~50 MB apt + perl deps, just to call adduser swapped to useradd
micromamba install git ~50 MB only Consensus's env actually needs git+... pip dropped from the 5 envs that don't
share/doc + share/man + share/info ~33 MB on Clean (27+5+1), ~6 MB on Alignment docs, never opened in a container removed
lib/jvm/lib/src.zip 56 MB on Clean (openjdk 11), 51 MB on core_scripts (openjdk 25) bundled JDK source archive, only used by IDE/debugger source step-through into JDK internals removed
multiqc's opt-in extras ~200 MB on Clean kaleido (static plot export), tiktoken (AI hints), spectra (advanced colour spaces) would need a patched bioconda recipe or a private channel
Multi-SIMD libs (polars, OpenBLAS, parasail) ~463 MB on Clean (410 polars + 40 OpenBLAS + 13 parasail) runtime CPU dispatch built into upstream, which is what keeps the container portable across CPUs from x86-64-v1 to v3+ kept on purpose (see below)
fastqc's bundled JDK ~313 MB on Clean (openjdk 11), ~370 MB on core_scripts (openjdk 25), after src.zip is gone full JDK; FastQC uses a small subset of modules needs a custom JRE / recipe
pip + setuptools + wheel tens of MB safe-ish to drop, but see footnote on pkg_resources not done

A note on the x86_64-conda-linux-gnu row: it only shows up on Alignment because that's the only env yaml pinning python=3.12, which drags the toolchain in. The others use python=3.10 and don't.

Footnote on pkg_resources: setuptools >=82 stopped shipping it as a top-level package, and the baseline images already don't expose it that way; if anything in the installed dep tree imported it at import time, the baseline would already be broken. So removing pip/setuptools/wheel is probably safe. Kept anyway as a precaution, since something could still lazy-import pkg_resources only on a code path I haven't walked.

What this PR removes (safe)

The marked rows above. None of these can affect what any tool does at runtime:

  • useradd instead of apt-get install adduser. Same end state, no apt index round-trip.
  • Drop micromamba install git in the 5 envs that have no git+... pip URL.
  • rm -rf /opt/conda/{pkgs,x86_64-conda-linux-gnu,include} and share/{doc,man,info}.
  • rm -f /opt/conda/lib/jvm/lib/src.zip on the two containers that ship a JDK (Clean, core_scripts). Pure source archive for IDE step-through; FastQC has no code path that reads it.

A note on rm -rf /opt/conda/pkgs: pkgs/ is the conda package cache. After install, files are referenced from the live env (lib/, bin/, ...) via copies or hardlinks, and conda's "what's installed" metadata lives in /opt/conda/conda-meta/, not in pkgs/. So removing pkgs/ only breaks things if something in the live env still points back into it (a stray symlink, a hardcoded path in a wrapper script). Spot-checked across the Alignment / Clean / core_scripts baselines, both checks come back zero. Conda-forge's official miniforge Docker image does the equivalent (conda clean --force-pkgs-dirs --all --yes).

Sizes:

Container Baseline Slimmed Saved
Alignment 1.19 GB 377 MB -68%
Clean 2.83 GB 1.97 GB -30%
ORF_analysis 786 MB 500 MB -36%
core_scripts 1.38 GB 961 MB -30%
mr_scripts 802 MB 517 MB -36%
Total 6.99 GB 4.32 GB -38%

I --version-checked each slim image on its primary tool. Didn't find container-level tests in the repo beyond tests/e2e/test_e2e.py, which I haven't run.

5 of 6 containers. containers/Consensus.dockerfile doesn't build cleanly on main today (gcc missing while pip-compiling biopython during the TrueConsense install); will file separately.

Looked at, didn't do

Each of these would save more, but each could break a lazy-import or read-only-fs code path I haven't ruled out:

  • Removing __pycache__ (~120 MB on python-heavy envs). Under singularity / read-only .sif, packages can't regenerate .pyc on first import; some packages that lazy-import a side-module hit a write-attempt and slow down or warn.
  • Removing per-package tests/ directories (~55 MB). A small number of python packages legitimately import from their own tests/ at runtime.
  • Removing share/locale (~15 MB). Tools using locale.gen may behave subtly differently.
  • Removing *.a / *.la static archives. Low risk, but assumes nothing JIT-links at runtime.

Bigger changes (not in this PR)

The unmarked rows above are where the biggest remaining bulk lives. None of them can be trimmed from a Dockerfile alone because they live inside conda-managed packages; each needs a patched recipe or a different upstream choice:

  • multiqc's opt-in extras (~200 MB on Clean): a patched bioconda recipe (or a private channel) that drops kaleido / tiktoken / spectra. Worth it only if multiqc's HTML report is what you actually use; AWS / parquet / static-image users would lose features.
  • Multi-SIMD libraries (polars, OpenBLAS, parasail) (~463 MB on Clean): each ships multiple CPU-target implementations and picks at runtime, which is what keeps the image runnable on roughly any x86-64 CPU. You could save a lot here if you drop support for CPUs you know nobody on your fleet runs; it depends on the lowest-spec machine you need to support. Polars ships two runtime variants (_polars_runtime_compat and _polars_runtime_32), each ~205 MB; deleting one is a one-liner. OpenBLAS and parasail need a recipe-level recompile.
  • fastqc's bundled JDK (~313 MB on Clean and ~370 MB on core_scripts, after src.zip is gone): either a stripped JRE built with jlink against the modules FastQC actually uses, shipped via a custom recipe, or a different conda recipe altogether. Note that the two containers ship different major JDK versions (openjdk 11 on Clean, openjdk 25 on core_scripts), driven by what each env yaml pulls in.
  • pip / setuptools / wheel removal: small win in MB, but pkg_resources is shipped by setuptools and is imported by various scientific Python packages. See footnote on the table.

Other findings while preparing this PR

Off-topic for the size question but turned up while poking at the container pipeline. Each could be something you already deal with; flagged in case any is news. Happy to file separately or skip:

  • Consensus.dockerfile appears not to build cleanly on main today. In my reproduction, pip install git+...TrueConsense fails with gcc: No such file or directory while compiling biopython from source. The published .sif may keep working because containers/build_containers.py:50-52 skips builds when the env hash is already in the upstream registry (if VersionHash in tags: ... continue).
  • tests/e2e/test_e2e.py may not import cleanly on a fresh install. The chain tests/e2e/test_e2e.pyViroConstrictor.__main__workflow_executorworkflow_config ends in from snakemake.resources import DefaultResources at workflow_config.py:23. That class has been removed from snakemake.resources in newer snakemake (gone in v9.20, present in v9.5). workflow.smk declares min_version("9.5") and env.yml pins snakemake-minimal==9.5.*, so as long as you install from the pinned env you're fine; a pip install ViroConstrictor outside that env would resolve a newer snakemake and the import breaks.
  • .github/workflows/build_and_test.yml is titled "build containers and run tests" but the steps build, zip, download apptainer artefacts, install ViroConstrictor, and end with the comment ## rest of the testing suite here. There's no step that actually runs anything against the result. Possible I missed a chained workflow.
  • pkg_resources is no longer importable as a top-level package in the current baseline images (setuptools >=82 dropped it). See footnote on the table above.

Going further: actual reproducibility

This PR is about size, not reproducibility, but they come up together. Absolute MB drifts between rebuilds (the appendix explains why); the saved-percent is what's stable. If you want to reduce that dep-version churn, the change is separate from this PR and roughly two steps:

  1. Pin FROM mambaorg/micromamba@sha256:... instead of :latest.
  2. Replace micromamba install -f /install.yml with micromamba create -f /lock.lock, where lock.lock is micromamba env export --explicit checked into the repo and bumped on dep updates.

That stops bioconda from re-solving transitive deps on every rebuild and keeps the base image fixed, so two rebuilds a month apart install the same package versions. Image bytes still won't be bit-identical (.pyc mtimes, BuildKit layer hashes, gzip/tar timestamps, the pip-install-from-git in Consensus) but those are smaller, separately-tractable issues.

Happy to address feedback or split this PR into per-container commits.

Werner

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 3, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant