NERSC · brandongc · Nov 13, 2025
diff --git a/README.md b/README.md
@@ -3,6 +3,10 @@
 Podman-HPC (`podman-hpc`) is a wrapper script around the Pod Manager (`podman`) container engine,
 which provides HPC configuration and infrastructure for the Podman ecosystem at NERSC.
 
+## Diagrams
+
+See `docs/diagrams.md` for Mermaid diagrams illustrating the architecture, command paths, configuration precedence, hooks, shared-run, and migration flows.
+
 ## Configuration
 
 The wrapper can be configured through a configuration file and environment variables.

diff --git a/docs/diagrams.md b/docs/diagrams.md
@@ -0,0 +1,217 @@
+## Podman-HPC diagrams
+
+This document provides a high-level, visual overview of how Podman-HPC works and the major tasks and code paths. Diagrams are written in Mermaid and render in many markdown viewers.
+
+### 1) Architecture overview
+
+```mermaid
+flowchart LR
+  subgraph user_space[User Space]
+    U[User] --> CLI[podman-hpc CLI]
+  end
+
+  subgraph podhpc[Podman-HPC]
+    CLI --> SC[SiteConfig]
+    SC -->|read config YAML and templates and env| SC2[Config state]
+    SC2 -->|compute default args and module args| Ext[get_cmd_extensions]
+    CLI --> CP[call_podman or subcommands]
+    Ext -->|inject default flags and hooks and env| CP
+  end
+
+  CP --> P[Podman]
+
+  subgraph oci_hook[OCI Hook]
+    P -- prestart when annotation podman_hpc.hook_tool=true --> HT[hook_tool]
+    HT --> MD[modules.d YAML]
+    HT -->|copy/bind actions| FS[(Container FS)]
+    HT --> LDC[ldconfig]
+  end
+
+  P --> C[(Container lifecycle)]
+
+  %% Other flows
+  CLI --> MIG[migrate / rmsqi / pull]
+  MIG --> MU[MigrateUtils]
+  MU --> IS[(Image stores: overlay, layers, squash)]
+
+  CLI --> SR[shared-run]
+  SR --> MON[monitor process]
+  SR --> RPROC[run process]
+  RPROC --> P
+  P --> EXEC[exec processes]
+  EXEC --> C
+  MON -->|wait all tasks then remove container| P
+```
+
+### 2) CLI command structure
+
+```mermaid
+flowchart TB
+  A[podman-hpc] --> B[Default passthrough: call_podman]
+  A --> C[infohpc]
+  A --> D[migrate]
+  A --> E[rmsqi]
+  A --> F[pull]
+  A --> G[shared-run]
+  B -->|for any subcommand| P[Podman]
+```
+
+Key mappings:
+- `call_podman`: wraps any `podman` subcommand and injects SiteConfig-derived flags.
+- `infohpc`: prints version and resolved configuration.
+- `migrate`: squashes an image into the squash store.
+- `rmsqi`: removes a previously squashed image from the squash store.
+- `pull`: pulls image via `podman`, then migrates on success.
+- `shared-run`: starts one container per node and execs tasks into it.
+
+### 3) Configuration precedence and environment setup
+
+Source: `podman_hpc/siteconfig.py`
+
+```mermaid
+flowchart TB
+  start([Start]) --> def[Built-in defaults]
+  def --> cfchk{Config file exists?}
+  cfchk -- yes --> read[Read config yaml]
+  cfchk -- no --> envchk{Env overrides?}
+  read --> tmpl["Template expansion (template keys)"]
+  tmpl --> envchk
+  envchk -- yes --> envset["Apply PODMANHPC_* env vars"]
+  envchk -- no --> finalize[Finalize config attributes]
+  envset --> finalize
+  finalize --> mods["read_site_modules()"]
+  mods --> args["Compute default_args + default_*_args"]
+  args --> env["config_env(hpc=True): set XDG_CONFIG_HOME, drop XDG_RUNTIME_DIR"]
+  env --> done([Config ready])
+```
+
+Notable outputs:
+- Default flags for `run`, `build`, `pull`, `images`.
+- Hooks enabled: `--hooks-dir`, `--annotation podman_hpc.hook_tool=true`, and `--env PODMANHPC_MODULES_DIR=...`.
+- `additionalimagestore`: includes squash dir and optional stores.
+
+### 4) OCI hook execution sequence
+
+Source: `podman_hpc/configure_hooks.py`, `podman_hpc/hook_tool.py`
+
+```mermaid
+sequenceDiagram
+  participant Podman
+  participant Hook as hook_tool (prestart)
+  participant Mods as modules.d (YAML)
+  participant FS as Container FS
+
+  Podman->>Hook: Invoke prestart hook (annotation=true)
+  Hook->>Hook: read config.json, merge env, read modules.d
+  Hook->>Hook: setns(pid, mnt)
+  Hook->>Hook: chroot(/)
+  Hook->>Mods: load module defs (copy/bind rules, env keys)
+  loop for each module enabled via env
+    Hook->>FS: perform copy/bind per rule (resolve src/dest with globs)
+  end
+  Hook->>Hook: chroot(root_path)
+  Hook->>FS: ldconfig
+  Hook-->>Podman: return (continue container init)
+```
+
+Module YAML keys used by hook:
+- `name`, `env` (enable via env var)
+- `copy`: file/dir copy rules
+- `bind`: bind-mount rules
+
+### 5) shared-run workflow (per node)
+
+Source: `podman_hpc/podman_hpc.py::_shared_run`
+
+```mermaid
+sequenceDiagram
+  participant User
+  participant PH as podman-hpc
+  participant Mon as monitor(Process)
+  participant Run as shared_run_exec(Process)
+  participant Podman
+  participant Cont as Container
+
+  User->>PH: podman-hpc shared-run [options] IMAGE CMD...
+  PH->>PH: parse options and filter valid run/exec flags
+  PH->>Mon: start monitor(sock, ntasks, container_name)
+  PH->>Run: start run process (podman run --rm -d --name ...)
+  PH->>Podman: wait until container exists + running (poll with backoff)
+  Note over PH: compute wait_poll_interval / wait_timeout based on ntasks
+  PH->>Podman: podman exec ... CMD (PMI_FD handled if present)
+  Podman->>Cont: execute user command(s)
+  PH->>Mon: send_complete(socket, localid)
+  Mon->>Podman: kill container and remove container
+  PH-->>User: exit with exec return code
+```
+
+PMI handling:
+- If `PMI_FD` is set, dup to fd 3 and pass via `--preserve-fds 1`.
+
+### 6) Migrate-to-scratch workflow
+
+Source: `podman_hpc/migrate2scratch.py`
+
+```mermaid
+flowchart TB
+  start(["migrate image"]) --> init[_lazy_init - resolve src-dst stores]
+  init --> refresh[initialize dst storage then refresh src and dst]
+  refresh --> info{image found?}
+  info -- no --> abort[[return False]]
+  info -- yes --> layers[get image layers]
+  layers --> dup{dst has image id?}
+  dup -- yes --> done[[previously migrated return True]]
+  dup -- no --> dtag[drop image tags]
+  dtag --> copyi[copy image info]
+  copyi --> copyl[copy required layers]
+  copyl --> overlay[copy overlay data]
+  overlay --> squash[generate squash file]
+  squash -- fail --> abort2[[return False]]
+  squash -- ok --> record[add image record]
+  record --> done[[return True]]
+```
+
+### 7) Module processing during command extension
+
+Source: `podman_hpc/siteconfig.py::get_cmd_extensions`
+
+```mermaid
+flowchart TB
+  start([Start get_cmd_extensions]) --> base[cmds = default_args]
+  base --> subcmd{subcommand?}
+  subcmd -- run --> runA[+ default_run_args]
+  subcmd -- build --> buildA[+ default_build_args]
+  subcmd -- pull --> pullA[+ default_pull_args]
+  subcmd -- images --> imgsA[+ default_images_args]
+  subcmd -- other --> noop[no-op]
+  runA --> pick
+  buildA --> pick
+  pullA --> pick
+  imgsA --> pick
+  noop --> pick
+  pick[Identify enabled modules from parsed CLI flags]
+  pick --> deps[Warn if required deps not enabled]
+  deps --> conf[Warn on conflicts]
+  conf --> ext[Append module additional_args; set env; set shared_run flag]
+  ext --> loglvl{log_level set?}
+  loglvl -- yes --> plus[+ --log-level LEVEL]
+  loglvl -- no --> out
+  plus --> out([Return cmds])
+  out --> endNode([End])
+```
+
+Enabled module logic:
+- A module is enabled when its `cli_arg` flag is present for the subcommand.
+- Adds `additional_args`, sets `-e <ENV>=1`, and may set `shared_run=True`.
+- Warnings are printed for missing `depends_on` and conflicting modules.
+
+---
+
+References:
+- CLI and shared-run: `podman_hpc/podman_hpc.py`
+- Config: `podman_hpc/siteconfig.py`
+- Hook configuration: `podman_hpc/configure_hooks.py`
+- Hook runtime: `podman_hpc/hook_tool.py`
+- Migration utilities: `podman_hpc/migrate2scratch.py`
+
+
diff --git a/podman_hpc/__init__.py b/podman_hpc/__init__.py
@@ -0,0 +1,8 @@
+"""Podman-HPC Python package.
+
+This package provides CLI integration and utilities for running Podman in
+HPC environments, including site configuration, hook configuration, and
+image migration helpers.
+"""
+
+
diff --git a/podman_hpc/argparse_exit_on_error.py b/podman_hpc/argparse_exit_on_error.py
@@ -1,12 +1,35 @@
-from argparse import *
+"""Compatibility wrapper for argparse exit-on-error behavior.
+
+Python 3.9's argparse does not support the `exit_on_error` keyword. This module
+provides a drop-in replacement that can be imported in place of `argparse` when
+running on older Python versions. It exposes an `ArgumentParser` subclass that
+respects the `exit_on_error` flag and suppresses `error()` behavior when
+requested.
+"""
+
+from argparse import *  # noqa: F401,F403 - re-export argparse API for consumers
 from argparse import ArgumentParser as _ArgumentParser
 
 
 class ArgumentParser(_ArgumentParser):
+    """ArgumentParser that can opt-out of exiting on parse errors.
+
+    When constructed with `exit_on_error=False`, the parser will not call the
+    default `error()` behavior (which prints a message and exits the program).
+    This enables callers to handle parse errors programmatically using
+    `parse_known_args` or by catching exceptions.
+    """
+
     def __init__(self, *args, exit_on_error=True, **kwargs):
-        self._exit_on_error = exit_on_error
+        """Initialize the parser.
+
+        Parameters
+        - exit_on_error: if False, suppress `error()` calls to avoid exiting.
+        """
+        self._exit_on_error = bool(exit_on_error)
         super().__init__(*args, **kwargs)
 
     def error(self, *args, **kwargs):
+        """Override error to respect `exit_on_error` flag."""
         if self._exit_on_error:
             super().error(*args, **kwargs)