Audience: Operators, bioinformaticians running workflows, developers
Scope: Configuration management, operational guidelines, maintenance
- Configuration Management
- Environment Setup
- Execution Strategies
- Output Management
- Performance Tuning
- Monitoring and Observability
- Maintenance
- Troubleshooting
- Production Deployment
METAINFORMANT uses a layered configuration system:
flowchart TD
A[Default Values<br/>in code] --> B[Config Template<br/>config/defaults.yaml]
B --> C[Environment-Specific<br/>config/prod.yaml]
C --> D[User Overrides<br/>~/.metainformant/config.yaml]
D --> E[Environment Variables<br/>RNA_MAX_WORKERS=16]
E --> F[Effective Config<br/>used by workflow]
# config/pipeline.yaml
pipeline:
name: "RNA-seq Analysis"
version: "1.0.0"
resources:
max_workers: 8
memory_per_worker_gb: 4
disk_space_threshold_gb: 100
input:
sample_manifest: "data/samples.tsv"
reference:
genome: "references/GRCh38.fa"
annotation: "references/GRCh38.gtf"
output:
base_dir: "output/"
formats: ["tsv", "parquet", "json"]
stages:
download:
enabled: true
retry_attempts: 3
timeout_seconds: 600
quantify:
tool: "kallisto"
bootstrap_samples: 100
logging:
level: "INFO"
file: "output/logs/pipeline.log"export METAINFORMANT_CONFIG_PATH="./config/custom.yaml"
export RNA_MAX_WORKERS=16 # Override worker count
export METAINFORMANT_LOG_LEVEL=DEBUG # Verbose loggingConvention: {MODULE}_{KEY} for module-specific overrides, METAINFORMANT_* for global.
from metainformant.core.utils.config import discover_config_files
configs = discover_config_files(repo_root=".", domain="rna")
# Returns list of matching config files sorted by precedenceAlways use uv — never pip directly:
# Create isolated environment
uv venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install project with all optional dependencies
uv sync --all-extras
# Add a new package
uv add biopython
uv add --dev pytest-cov
# Update lockfile after changes
uv lockPackage groups (defined in pyproject.toml):
| Group | Purpose | Install |
|---|---|---|
rna |
RNA-seq analysis (amalgkit) | uv sync --group rna |
gwas |
GWAS tools (PLINK, SAIGE) | uv sync --group gwas |
viz |
Visualization (matplotlib, seaborn, plotly) | uv sync --group viz |
dev |
Testing, linting, docs | uv sync --group dev |
all |
Everything (default) | uv sync --all-extras |
Requirement: Python 3.11+ (uses from __future__ import annotations).
Check:
python --version # must be >= 3.11Some modules require external tools:
| Tool | Package | Module |
|---|---|---|
| kallisto | kallisto (binary) |
rna |
| SRA toolkit | sratoolkit (NCBI) |
rna.retrieval |
| PLINK | plink1.9 / plink2 |
gwas |
| BWA | bwa (binary) |
dna |
| Docker (optional) | docker daemon |
cloud |
Install system packages via conda/apt/brew:
conda install -c bioconda kallisto plink bwa
# or
apt-get install plink bwa# Run specific workflow
python -m metainformant.rna.workflow --config config/rna_pipeline.yaml
# Or use entry point script
rna-pipeline --samples samples.tsv --output output/from pathlib import Path
from metainformant.core.engine.workflow_manager import BasePipelineManager, PipelinePhase
from metainformant.rna.engine.workflow import RNAWorkflowManager
# Programmatic construction
workflow = RNAWorkflowManager(
work_dir=Path("output/rna_run1"),
config={"samples": "samples.tsv"},
max_threads=8,
)
results = workflow.run()
# Inspect results
successful = [sid for sid, ok in results.items() if ok]
failed = [sid for sid, ok in results.items() if not ok]# Launch interactive workflow selector
python -m metainformant.menu
# Or programmatically
from metainformant.menu import WorkflowMenu
menu = WorkflowMenu()
choice = menu.select_workflow()
menu.run_workflow(choice)output/
{module}/
{phase}/
{item_id}.{ext}
logs/
pipeline_20250426.log
reports/
summary.tsv
qc_plots/
checkpoint.json
shared/
references/
tmp/
archive/
2025-04-26/
Automate cleanup of stale runs:
import shutil
from datetime import datetime, timedelta
def cleanup_old_runs(output_dir: Path, retention_days: int = 30):
cutoff = datetime.now() - timedelta(days=retention_days)
for run_dir in output_dir.glob("runs/*"):
if run_dir.is_dir():
mtime = datetime.fromtimestamp(run_dir.stat().st_mtime)
if mtime < cutoff:
shutil.rmtree(run_dir)
logger.info(f"Removed old run: {run_dir}")Schedule via cron:
0 2 * * * /usr/bin/python /path/to/cleanup.pyCompress intermediate files not currently in use:
import gzip
def compress_directory(dir_path: Path):
for file in dir_path.rglob("*.json"):
with open(file, "rb") as f_in, gzip.open(f"{file}.gz", "wb") as f_out:
shutil.copyfileobj(f_in, f_out)
file.unlink() # delete originalRule of thumb:
| Resource | Task Type | Formula |
|---|---|---|
| CPU | CPU-bound (alignment, quantification) | max(1, os.cpu_count() - 1) |
| I/O | Downloads, API calls | min(os.cpu_count() * 4, 32) |
| Memory | Memory-heavy (large matrix ops) | min(memory_available_gb // per_worker_gb, 8) |
Use function:
from metainformant.core.execution.parallel import resource_aware_workers
workers = resource_aware_workers(
task_type="io",
memory_per_worker_mb=2048,
)-
Stream large files: Use iterators, not
read()all at once:with io.open_text_auto("large.jsonl.gz") as f: for line in f: # line-by-line, not load_json() process(line)
-
Delete large objects explicitly when no longer needed:
del large_dataframe import gc; gc.collect()
-
Use chunking: Process 1,000 records at a time instead of 1 million.
-
SSD preferred:
output/on fast SSD (NVMe) for temp files -
RAM disk for temp: If enough RAM:
mount -t tmpfs -o size=50G tmpfs /mnt/ramdisk METAINFORMANT_TEMP_DIR=/mnt/ramdisk python -m ...
-
Compress intermediate: Gzip
.jsonafter use; decompress on read
- Parallel downloads: Use
max_threads> 8 for I/O bound - Retry with backoff: Avoid hammering flaky endpoints
- Cache DNS: Pre-resolve hosts if possible
from metainformant.core.utils.logging import setup_logging
setup_logging(
level="INFO", # or "DEBUG" for verbose
format="json", # structured JSON logs for ingestion
log_file="output/logs/pipeline.log",
)Log levels:
DEBUG: Detailed per-item state transitions (verbose)INFO: Phase start/end, item completionWARNING: Recoverable issues (missing optional data)ERROR: Item failures, phase failuresCRITICAL: Pipeline abort, system-level errors
Default: BasePipelineManager auto-starts TUI.
Disable (for logging-only runs):
manager = BasePipelineManager(phases, use_tui=False)Non-interactive (CI/CD):
METAINFORMANT_NO_TUI=1 python -m metainformant.rna.workflow ...If prometheus_client installed:
from metainformant.monitoring import start_metrics_server
start_metrics_server(port=8000) # exposes /metricsMetrics include:
pipeline_items_totalphase_duration_secondsphase_failures_totalbytes_downloaded_total
Keep dependencies current:
# Check for outdated packages
uv pip list --outdated
# Update specific
uv add --upgrade biopython
# Update all
uv pip compile --upgrade pyproject.toml -o uv.lock
uv syncFrequency: Monthly, or before major analysis runs.
Configure logging.handlers.RotatingFileHandler in setup_logging():
from logging.handlers import RotatingFileHandler
handler = RotatingFileHandler(
"output/logs/pipeline.log",
maxBytes=100 * 1024 * 1024, # 100 MB
backupCount=10,
)Cron cleanup:
# Find and compress old logs
find output/logs/ -name "*.log" -mtime +30 -exec gzip {} \;- Keep
config/directory under version control - Tag releases with config hash:
git tag -a run-20250426-$(sha256sum config.yaml) - Document config changes in
docs/CHANGELOG.md
If metainformant.core.data.db is active:
# Vacuum to reclaim space
db.execute("VACUUM FULL;")
# Analyze for query planning
db.execute("ANALYZE;")
# Backup
db.backup("backup.sql")Schedule monthly if production deployment.
| Issue | Symptoms | Diagnosis | Fix |
|---|---|---|---|
| Pipeline hangs | No TUI updates, CPU idle | Handler blocking main thread | Move I/O to executor, add timeouts |
| Out of Memory | Process killed, system swap thrashing | Too many parallel workers | Reduce max_threads, use batch processing |
| Permission denied | OSError: [Errno 13] |
Output dir not writable | chmod -R u+w output/ or choose different dir |
| Disk full | OSError: [Errno 28] |
No space left | Clean old runs, increase storage |
| Slow downloads | 1 MB/s vs expected 100 MB/s | Network saturating, too few workers | Increase max_threads for I/O phase |
| Checksum mismatch | ValueError: Checksum mismatch |
Corrupted download or mirror issue | Delete partial file, re-download, check source |
| Module import error | ModuleNotFoundError |
Dependencies not installed | uv sync --all-extras or install specific group |
| TUI garbled output | Weird characters, layout broken | stdout not a TTY (piped) | Disable TUI: use_tui=False or METAINFORMANT_NO_TUI=1 |
- Check
output/core/logs/*.logfor ERROR/CRITICAL lines - Verify
output/directory structure exists and is writable - Run with
METAINFORMANT_LOG_LEVEL=DEBUGfor verbose trace - Inspect
checkpoint.json(if exists) for partial state - Test handler in isolation with single item
- Confirm external tools installed (kallisto, plink):
which kallisto - Validate config schema:
python -m metainformant.core.execution.workflow --validate config.yaml
Document exact environment:
# Capture environment
uv export --all-extras > requirements.txt
git rev-parse HEAD > commit.txt
python -c "import metainformant; print(metainformant.__version__)" > version.txt
# Bundle
tar czf run_metadata.tar.gz requirements.txt commit.txt version.txt config/# /etc/systemd/system/metainformant-rna.service
[Unit]
Description=METAINFORMANT RNA Pipeline
After=network.target
[Service]
Type=simple
User=bioinformatics
WorkingDirectory=/home/bioinformatics/projects/analysis
Environment="METAINFORMANT_LOG_LEVEL=INFO"
ExecStart=/home/bioinformatics/.venv/bin/python -m metainformant.rna.workflow --config /home/bioinformatics/configs/rna.yaml
Restart=on-failure
RestartSec=300
[Install]
WantedBy=multi-user.targetsudo systemctl enable metainformant-rna
sudo systemctl start metainformant-rna
sudo journalctl -u metainformant-rna -f # follow logsFROM python:3.11-slim
RUN apt-get update && apt-get install -y \
kallisto \
plink2 \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . .
RUN uv sync --all-extras
ENTRYPOINT ["uv", "run", "python", "-m", "metainformant.rna.workflow"]
CMD ["--config", "/config/pipeline.yaml"]Build and run:
docker build -t metainformant/rna .
docker run -v /data:/data -v /config:/config metainformant/rna --config /config/pipeline.yaml#!/bin/bash
#SBATCH --job-name=metainformant_rna
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=24:00:00
#SBATCH --output=slurm-%j.out
module load python/3.11
source /path/to/venv/bin/activate
python -m metainformant.rna.workflow --config $1Submit:
sbatch run_rna.slurm configs/experiment1.yamlModify config at runtime based on data characteristics:
def adaptive_handler(manager, items):
# Examine first item to determine parameters
sample = items[0]
size = Path(sample.metadata["fastq"]).stat().st_size
if size > 10e9: # >10 GB
manager.config["threads"] = 16
else:
manager.config["threads"] = 8
# Proceed with adaptive config
process_batch(items, manager.config["threads"])Isolate runs per user/project:
import getpass
user = getpass.getuser()
output_base = Path(f"/shared/runs/{user}/{run_id}/")
user_config = load_mapping_from_file(f"/home/{user}/.metainformant/config.yaml")Never store credentials in config files. Use:
- Environment variables:
export NCBI_API_KEY=xxx - Secret managers: HashiCorp Vault, AWS Secrets Manager
- Protected files:
~/.netrcwith restrictive permissions (chmod 600)
Access:
import os
api_key = os.environ.get("NCBI_API_KEY")
if not api_key:
raise ValueError("Set NCBI_API_KEY environment variable")When adding a new domain module that participates in coordination:
- Define clear
PipelinePhasehandlers or config-driven schema - Use
BasePipelineManagerif >2 phases needed - Document metadata keys produced/consumed
- Implement per-item try/except calling
mark_failed() - Write outputs atomically via
io.dump_json() - Add validation at start of each handler (inputs, paths exist)
- Verify idempotency (re-run doesn't duplicate/corrupt)
- Integrate with TUI (
manager.ui.update()) - Add module rule in
docs/agents/rules/{module}.md - Cross-link from
docs/index.mdmatrix - Provide example configuration in module README
- Safety detailed error handling
- Communication Protocols for metadata sharing
- Orchestration API reference
- Multi-Agent Workflows — patterns in action