METAINFORMANT

Comprehensive bioinformatics toolkit for multi-omic analysis

Overview

METAINFORMANT provides production-ready bioinformatics analysis across genomics, transcriptomics, proteomics, epigenomics, and systems biology. Built with Python 3.11+ and uv for fast dependency management.

At a Glance

Metric	Value
Modules	28 specialized analysis modules
Python Files	603 implementation files
Plot Types	70+ visualization methods
Documentation	310+ README files

Core Capabilities

Domain	Features
DNA	Sequences, alignment, phylogenetics, population genetics, variant analysis
RNA	Amalgkit integration, ENA/SRA downloads, Kallisto quantification, industrial-scale pipelines (8,300+ samples across 28 species)
GWAS	Association testing, fine-mapping, visualization, complete GWAS pipelines
eQTL	Integration of GWAS variants and Amalgkit RNA-seq expression data
Multi-omics	Cross-omic integration, joint PCA, correlation analysis
ML	Classification, regression, feature selection, LLM integration
Visualization	Manhattan plots, heatmaps, networks, animations, publication-ready output

System Architecture

flowchart TB
    subgraph coreInfra["Core Infrastructure"]
        CORE["Core Utilities"]
    end

    subgraph molecular["Molecular Analysis"]
        DNA["DNA Analysis"]
        RNA["RNA Analysis"]
        PROT["Protein Analysis"]
        EPI["Epigenome Analysis"]
    end

    subgraph statsML["Statistical and ML"]
        GWAS["GWAS Analysis"]
        MATH["Mathematical Biology"]
        ML["Machine Learning"]
        INFO["Information Theory"]
    end

    subgraph systems["Systems Biology"]
        NET["Network Analysis"]
        MULTI["Multi-Omics Integration"]
        SC["Single-Cell Analysis"]
        SIM["Simulation"]
    end

    subgraph annotation["Annotation and Metadata"]
        ONT["Ontology"]
        PHEN["Phenotype Analysis"]
        ECO["Ecology"]
        LE["Life Events"]
    end

    subgraph utilities["Utilities"]
        QUAL["Quality Control"]
        VIZ["Visualization"]
    end

    subgraph specialized["Specialized Domains"]
        LR["Long-Read Sequencing"]
        METAG["Metagenomics"]
        SV["Structural Variants"]
        SPATIAL["Spatial Transcriptomics"]
        PHARMA["Pharmacogenomics"]
        METAB["Metabolomics"]
        MENU["Menu System"]
        CLOUD["Cloud Deployment"]
    end

    CORE --> DNA
    CORE --> RNA
    CORE --> PROT
    CORE --> EPI
    CORE --> GWAS
    CORE --> MATH
    CORE --> ML
    CORE --> INFO
    CORE --> NET
    CORE --> MULTI
    CORE --> SC
    CORE --> SIM
    CORE --> ONT
    CORE --> PHEN
    CORE --> ECO
    CORE --> LE
    CORE --> QUAL
    CORE --> VIZ
    CORE --> LR
    CORE --> METAG
    CORE --> SV
    CORE --> SPATIAL
    CORE --> PHARMA
    CORE --> METAB
    CORE --> MENU
    CORE --> CLOUD

Data Flow and Integration Architecture

graph TD
    A["Raw Biological Data"] --> B["Data Ingestion"]
    B --> C{Data Type}

    C -->|DNA| D["DNA Module"]
    C -->|RNA| E["RNA Module"]
    C -->|Protein| F["Protein Module"]
    C -->|Epigenome| G["Epigenome Module"]
    C -->|Phenotype| H["Phenotype Module"]
    C -->|Environmental| I["Ecology Module"]

    D --> J["Quality Control"]
    E --> J
    F --> J
    G --> J
    H --> J
    I --> J

    J --> K["Core Processing"]
    K --> L{Analysis Type}

    L -->|Statistical| M["GWAS Module"]
    L -->|ML| N["ML Module"]
    L -->|Information| O["Information Module"]
    L -->|Networks| P["Networks Module"]
    L -->|Systems| Q["Multi-Omics Module"]
    L -->|Singlecell| R["Single-Cell Module"]
    L -->|Simulation| S["Simulation Module"]

    M --> T["Results Integration"]
    N --> T
    O --> T
    P --> T
    Q --> T
    R --> T
    S --> T

    T --> U["Visualization"]
    U --> V["Publication Figures"]
    V --> W["Scientific Insights"]

    subgraph "Primary Data Types"
        X["Genomic"] -.-> D
        Y["Transcriptomic"] -.-> E
        Z["Proteomic"] -.-> F
        AA["Epigenetic"] -.-> G
    end

    subgraph "Analysis Workflows"
        BB["Population Genetics"] -.-> M
        CC["Feature Selection"] -.-> N
        DD["Mutual Information"] -.-> O
        EE["Community Detection"] -.-> P
        FF["Joint PCA"] -.-> Q
        GG["Trajectory Analysis"] -.-> R
    end

    subgraph "Output Formats"
        HH["Manhattan Plots"] -.-> V
        II["Heatmaps"] -.-> V
        JJ["Network Graphs"] -.-> V
        KK["Animations"] -.-> V
    end

Multi-Omic Integration Pipeline

graph TD
    A["Multi-Omic Datasets"] --> B["Sample Alignment"]
    B --> C["Batch Effect Correction"]

    C --> D{Integration Strategy}
    D -->|Early| E["Concatenated Matrix"]
    D -->|Late| F["Separate Models"]
    D -->|Intermediate| G["Meta-Analysis"]

    E --> H["Joint Dimensionality Reduction"]
    F --> I["Individual Analysis"]
    G --> J["Result Integration"]

    H --> K["Unified Clustering"]
    I --> L["Individual Clustering"]
    J --> M["Consensus Clustering"]

    K --> N["Functional Enrichment"]
    L --> N
    M --> N

    N --> O["Pathway Analysis"]
    O --> P["Network Construction"]

    P --> Q["Biological Interpretation"]
    Q --> R["Systems Biology Insights"]

    subgraph "Omic Layers"
        S["Genomics"] -.-> A
        T["Transcriptomics"] -.-> A
        U["Proteomics"] -.-> A
        V["Metabolomics"] -.-> A
        W["Epigenomics"] -.-> A
    end

    subgraph "Integration Methods"
        X["MOFA"] -.-> H
        Y["Joint PCA"] -.-> H
        Z["Similarity Networks"] -.-> H
    end

    subgraph "Biological Outputs"
        AA["Gene Modules"] -.-> Q
        BB["Regulatory Networks"] -.-> Q
        CC["Disease Pathways"] -.-> Q
        DD["Biomarkers"] -.-> Q
    end

Quality Assurance Framework

graph TD
    A["Data Processing Pipeline"] --> B["Input Validation"]
    B --> C["Type Checking"]
    C --> D["Schema Validation"]

    D --> E["Processing Logic"]
    E --> F["Error Handling"]
    F --> G["Recovery Mechanisms"]

    G --> H["Output Validation"]
    H --> I["Result Verification"]
    I --> J["Quality Metrics"]

    J --> K{Acceptable Quality?}
    K -->|Yes| L["Pipeline Success"]
    K -->|No| M["Quality Issues"]

    M --> N["Diagnostic Analysis"]
    N --> O["Error Classification"]

    O --> P{Recoverable?}
    P -->|Yes| Q["Data Correction"]
    P -->|No| R["Pipeline Failure"]

    Q --> E
    L --> S["Validated Results"]
    R --> T["Error Reporting"]

    subgraph "Validation Layers"
        U["Data Integrity"] -.-> B
        V["Business Logic"] -.-> E
        W["Statistical Validity"] -.-> H
    end

    subgraph "Quality Controls"
        X["Unit Tests"] -.-> F
        Y["Integration Tests"] -.-> I
        Z["Performance Benchmarks"] -.-> J
    end

    subgraph "Error Types"
        AA["Data Errors"] -.-> O
        BB["Logic Errors"] -.-> O
        CC["System Errors"] -.-> O
        DD["External Errors"] -.-> O
    end

Key Features

Multi-Omic Analysis: DNA, RNA, protein, and epigenome data integration
Statistical & ML Methods: GWAS, population genetics, machine learning pipelines
Single-Cell Genomics: Complete scRNA-seq analysis workflows
Network Analysis: Biological networks, pathways, community detection algorithms
Visualization Suite: 14 specialized plotting modules with 70+ plot types and publication-quality output
Modular Architecture: Individual modules or complete end-to-end workflows
Comprehensive Documentation: 310+ README files with technical specifications
Implementation Testing: Real methods in tests, no mocks or stubs
Quality Assurance: Rigorous validation and error handling throughout
Performance Optimization: Efficient algorithms for large-scale biological data

Quick Start

I Want To...

Analyze DNA sequences:

# One-liner: GC content, k-mer analysis, phylogeny
uv run python -c "
from metainformant.dna import sequences, composition, phylogeny
seqs = sequences.read_fasta('data/sequences.fasta')
gc = [composition.gc_content(s) for s in seqs.values()]
print(f'Avg GC: {sum(gc)/len(gc):.1f}%')
"

Run RNA-seq pipeline (amalgkit):

# 28-species workflow in ~6 hours on n1-standard-16
python3 scripts/rna/orchestrate_species.py \
  --species-list config/hymenoptera_28_species.txt \
  --output output/rna_complete/

Perform GWAS analysis:

# Association testing with population structure correction
python3 scripts/gwas/pipelines/run_analysis.py \
  --vcf data/genotypes.vcf.gz \
  --pheno data/phenotypes.tsv \
  --config config/gwas/amellifera.yaml

Visualize results:

from metainformant.visualization import plots
fig = plots.manhattan(gwas_results)  # or heatmap, network, tree...
fig.savefig('output/figures/manhattan.png', dpi=300)

Deploy to cloud (GCP):

# Spin up VM, run pipeline, collect results, tear down
python3 scripts/cloud/deploy_gcp.py --config config/cloud.yaml

Choosing the Right Module

Your Data Type	Use This Module	Start Here
DNA sequences (FASTA)	`dna`	docs/dna/
RNA-seq (FASTQ, BAM)	`rna` (amalgkit)	docs/rna/
VCF + phenotypes	`gwas`	docs/gwas/workflow.md
Protein (FASTA, PDB)	`protein`	docs/protein/
Single-cell (h5ad, mtx)	`singlecell`	docs/singlecell/
Methylation arrays/bams	`epigenome`	docs/epigenome/
Microbiome (16S, metagenome)	`metagenomics`	docs/metagenomics/
Multiple omics (joint analysis)	`multiomics`	docs/multiomics/
Gene lists + GO terms	`ontology`	docs/ontology/
Phenotype traits	`phenotype`	docs/phenotype/
Ecological communities	`ecology`	docs/ecology/
Long-read (PacBio/ONT)	`longread`	docs/longread/
Networks & pathways	`networks`	docs/networks/
Information theory analysis	`information`	docs/information/
Simulation/synthetic data	`simulation`	docs/simulation/
Visualizations only	`visualization`	docs/visualization/
GCP cloud deployment	`cloud`	src/metainformant/cloud/README.md

Not sure? Read the full module matrix.

First-Time Visitor Path

Install (10 min): Follow QUICKSTART.md
Run demo (2 min): python3 scripts/core/run_demo.py
Pick your domain: See table above → click module link
Read workflow guide: Each module's docs/<module>/workflow.md
Try on sample data: Each module has tests/data/<module>/ examples
Run on your data: Replace sample paths with your files

Prerequisites

Module Status Overview

Production-Ready Modules

Category	Module	Status	Key Features
Core	core/	[DONE] Complete	I/O, config, logging, parallel, cache, validation, workflow orchestration
DNA	dna/	[DONE] Complete	Sequences, alignment, phylogeny, population genetics, variant analysis
RNA	rna/	[DONE] Complete & Verified	AMALGKIT integration, workflow orchestration, expression quantification
Protein	protein/	[DONE] Complete	Sequences, structures, AlphaFold, UniProt, functional analysis
GWAS	gwas/	[DONE] Complete	Association testing, QC, population structure, visualization
Math	math/	[DONE] Complete	Population genetics, coalescent, selection, epidemiology
Visualization	visualization/	[DONE] Complete	70+ plot types, animations, publication-quality output
Ontology	ontology/	[DONE] Complete	GO analysis, semantic similarity, functional annotation
Quality	quality/	[DONE] Complete	FASTQ analysis, validation, contamination detection

Functional Modules (Partial Implementation)

Category	Module	Status	Key Features	Coverage
ML	ml/	[PARTIAL] Partial	Classification, regression, feature selection	75%
Networks	networks/	[PARTIAL] Partial	Graph algorithms, community detection	78%
Multi-Omics	multiomics/	[PARTIAL] Partial	Integration, joint PCA, correlation	72%
Single-Cell	singlecell/	[PARTIAL] Partial	Preprocessing, clustering, DE analysis	74%
Epigenome	epigenome/	[PARTIAL] Partial	Methylation, ChIP-seq, ATAC-seq	76%
Phenotype	phenotype/	[PARTIAL] Partial	AntWiki integration, trait analysis	79%
Ecology	ecology/	[PARTIAL] Partial	Community diversity, environmental	77%
Life Events	life_events/	[PARTIAL] Partial	Event sequences, embeddings	73%
Simulation	simulation/	[PARTIAL] Partial	Sequence simulation, ecosystems	71%
Information	information/	[PARTIAL] Partial	Entropy, mutual information	80%

Specialized Domain Modules

Category	Module	Status	Key Features	Coverage
Long-Read	longread/	[PARTIAL] Partial	PacBio/ONT sequencing, assembly, error correction	65%
Metagenomics	metagenomics/	[PARTIAL] Partial	Taxonomic profiling, functional annotation	60%
Structural Variants	structural_variants/	[PARTIAL] Partial	SV/CNV detection, breakpoint resolution	55%
Spatial	spatial/	[PARTIAL] Partial	Spatial transcriptomics, tissue mapping	50%
Pharmacogenomics	pharmacogenomics/	[PARTIAL] Partial	Drug-gene interactions, variant interpretation	55%
Metabolomics	metabolomics/	[PARTIAL] Partial	MS data processing, pathway mapping	50%
Menu	menu/	[PARTIAL] Partial	Interactive CLI menu, workflow navigation	70%
Cloud	cloud/	[DONE] Complete	GCP VM lifecycle, Docker pipelines, genome prep	90%
eQTL	gwas/finemapping/eqtl	[DONE] Complete	Expression-genotype association, cis-eQTL scanning	85%
MCP	mcp/	[PARTIAL] Partial	Model Context Protocol tool implementations	40%

Module Overview

Complete Module Reference

All modules live in src/metainformant/ with documentation in each module's README.md.

Module	Files	Description	Key Components	Docs
Core Infrastructure
`core/`	26	Shared utilities, I/O, logging, config, parallel processing, caching	`io/`, `data/`, `execution/`	README
Molecular Analysis
`dna/`	27	DNA sequences, alignment, phylogenetics, population genetics, variants	`sequence/`, `alignment/`, `population/`	README
`rna/`	29	RNA-seq workflows, amalgkit integration, expression quantification	`amalgkit/`, `engine/`, `analysis/`	README
`protein/`	17	Protein sequences, structure analysis, AlphaFold, UniProt integration	`sequence/`, `structure/`, `database/`	README
`epigenome/`	8	Methylation analysis, ChIP-seq, ATAC-seq, chromatin accessibility	`assays/`, `chromatin_state/`, `peak_calling/`	README
Statistical & ML
`gwas/`	39	GWAS, fine-mapping, eQTL analysis, colocalization, visualization	`finemapping/`, `visualization/`, `analysis/`	README
`math/`	20	Population genetics theory, coalescent, selection, epidemiology	`population_genetics/`, `epidemiology/`, `evolutionary_dynamics/`	README
`ml/`	12	Machine learning pipelines, classification, regression, features	`models/`, `features/`, `llm/`	README
`information/`	14	Information theory, Shannon entropy, mutual information, semantic similarity	`metrics/`, `integration/`	README
Systems Biology
`networks/`	9	Biological networks, graph algorithms, community detection, pathways	`analysis/`, `interaction/`	README
`multiomics/`	6	Multi-omic integration, joint PCA, cross-omic correlation	`analysis/`, `methods/`	README
`singlecell/`	9	scRNA-seq preprocessing, clustering, differential expression	`data/`, `analysis/`, `visualization/`	README
`simulation/`	7	Synthetic data, agent-based models, sequence simulation, ecosystems	`models/`, `workflow/`, `benchmark/`	README
Annotation & Metadata
`ontology/`	7	Gene Ontology, functional annotation, semantic similarity	`core/`, `query/`, `visualization/`	README
`phenotype/`	15	Phenotypic data curation, AntWiki integration, trait analysis	`analysis/`, `data/`, `behavior/`	README
`ecology/`	7	Community diversity, environmental correlations, species matrices	`analysis/`, `phylogenetic/`, `visualization/`	README
`life_events/`	9	Life course analysis, event sequences, temporal embeddings	`models/`, `workflow/`	README
Utilities
`quality/`	4	FASTQ quality assessment, validation, contamination detection	`io/`, `analysis/`, `reporting/`	README
`visualization/`	22	70+ plot types, heatmaps, networks, animations, publication-ready	`plots/`, `genomics/`, `analysis/`	README
Specialized Domains
`longread/`	19	Long-read sequencing (PacBio, ONT), assembly, error correction	`assembly/`, `quality/`	README
`metagenomics/`	11	Metagenomic analysis, taxonomic profiling, functional annotation	`taxonomy/`, `functional/`	README
`pharmacogenomics/`	12	Drug-gene interactions, pharmacokinetics, variant interpretation	`interactions/`	README
`spatial/`	11	Spatial transcriptomics, tissue mapping, spatial statistics	`analysis/`	README
`structural_variants/`	9	SV detection, CNV analysis, breakpoint resolution	`detection/`	README
`menu/`	4	Interactive CLI menu system, workflow navigation	`ui/`	README

Total: 26 modules, 603 Python files

Documentation

Quick Links

Documentation Guide - Complete navigation guide
Quick Start - Fast setup commands
Architecture - System design
Technical Specification - Design standards

Transcriptomics (RNA-seq)

Workflow Guide — ENA-first amalgkit streaming pipeline
Troubleshooting — IO contention & SRA setup fixes
Tissue Patching — Custom metadata correction
Ortholog Generation — Automated cross-species mapping
Step Documentation — The 11-step amalgkit process
Testing Guide - Comprehensive testing documentation
CLI Reference - Command-line interface
eQTL Integration - eQTL pipeline documentation

Module Documentation

Each module has documentation in src/metainformant/<module>/README.md and docs/<module>/.

Scripts & Workflows

The scripts/ directory contains production-ready workflow orchestrators:

Package Management: Setup, testing, quality control
RNA-seq (Amalgkit): Multi-species workflows, amalgkit integration
GWAS (Variants): Genome-scale association studies
eQTL Integration: RNA-seq + Variant cross-omics integration pipelines
Module Orchestrators: Complete workflow scripts for all domains (core, DNA, RNA, protein, networks, multiomics, single-cell, quality, simulation, visualization, epigenome, ecology, ontology, phenotype, ML, math, gwas, information, life_events)

See scripts/README.md for documentation.

CLI Interface

The metainformant command exposes a small CLI (docs/cli.md): --version, --modules, protein (taxon-ids, comp, rmsd-ca), quality batch-detect, rna info, gwas info. Full domain workflows use Python imports, scripts/*/run_*.py, or python -m metainformant.rna.amalgkit.

uv run metainformant --help
uv run metainformant --modules
uv run metainformant protein taxon-ids --file data/taxon_ids.txt
uv run metainformant protein comp --fasta data/proteins.fasta
uv run metainformant protein rmsd-ca --pdb-a data/structure1.pdb --pdb-b data/structure2.pdb
uv run metainformant quality batch-detect --data samples.csv --batches batches.txt

# RNA-seq (config-driven script; see docs/cli.md for Python API)
python3 scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml

See docs/cli.md for CLI documentation.

Usage Examples

DNA Analysis

from metainformant.dna import alignment, population

# Pairwise alignment
align_result = alignment.pairwise.global_align("ACGTACGT", "ACGTAGGT")
print(f"Score: {align_result.score}")

# Population genetics
sequences = ["ATCGATCG", "ATCGTTCG", "ATCGATCG"]
diversity = population.nucleotide_diversity(sequences)
print(f"π = {diversity:.4f}")

RNA-seq Workflow

from pathlib import Path

from metainformant.rna.amalgkit import check_cli_available
from metainformant.rna.engine.workflow import AmalgkitWorkflowConfig, execute_workflow, plan_workflow

available, help_text = check_cli_available()
if not available:
    print(f"Amalgkit not available: {help_text}")

config = AmalgkitWorkflowConfig(
    work_dir=Path("output/amalgkit/work"),
    threads=8,
    species_list=["Apis_mellifera"],
)

steps = plan_workflow(config)
print(f"Planned {len(steps)} workflow steps")

results = execute_workflow(config)
for step_result in results.steps_executed:
    print(f"{step_result.step_name}: exit code {step_result.return_code}")

# End-to-end workflow for a single species (recommended)
python3 scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml

# Check status
python3 scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml --status

# Alternative: Bash-based orchestrator
bash scripts/rna/amalgkit/run_amalgkit.sh --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml

GWAS Analysis

from metainformant.gwas import manhattan_plot, run_gwas

results = run_gwas(
    vcf_path="data/variants/cohort.vcf.gz",
    phenotype_path="data/phenotypes/traits.tsv",
    config={"association": {"model": "linear"}},
    output_dir="output/gwas"
)

# Visualize results
manhattan_plot(results["association_results"], output_path="output/gwas/manhattan.png")

eQTL Integration Pipeline

The eQTL pipeline bridges the genomic variants from the GWAS pipeline with the gene expression matrices provided by the Amalgkit (RNA) pipeline.

# Run the pipeline leveraging real Amalgkit RNA-seq quantification data
uv run python scripts/eqtl/run_eqtl_real.py

# Or explore the logic with synthetic data
uv run python scripts/eqtl/run_eqtl_demo.py

# Call SNP variants directly from transcriptome RNA-seq data
uv run python scripts/eqtl/rna_snp_pipeline.py --species amellifera --n-samples 3

Visualization

from metainformant.visualization import heatmap, animate_time_series

# Heatmap
heatmap(correlation_matrix, cmap="viridis", annot=True)

# Animation
fig, anim = animate_time_series(time_series_data)
anim.save("output/animation.gif")

Network Analysis

from metainformant.networks import create_network, detect_communities, centrality_measures

# Create network from interactions
network = create_network(edges, directed=False)

# Detect communities
communities = detect_communities(network)

# Calculate centrality
centrality = centrality_measures(network)

Multi-Omics Integration

from metainformant.multiomics import integrate_omics_data, joint_pca

# Integrate multiple omics datasets
multiomics = integrate_omics_data(
    genomics=genomics_data,
    transcriptomics=rna_data,
    proteomics=protein_data
)

# Joint dimensionality reduction
pca_result = joint_pca(multiomics)

Information Theory

from metainformant.information import shannon_entropy, mutual_information, information_content

# Calculate Shannon entropy
probs = [0.5, 0.3, 0.2]
entropy = shannon_entropy(probs)

# Mutual information between sequences
mi = mutual_information(sequence_x, sequence_y)

# Information content for hierarchical terms
ic = information_content(term_frequencies, "GO:0008150")

Life Events Analysis

from metainformant.life_events import EventSequence, Event, analyze_life_course
from datetime import datetime

# Create event sequences
events = [
    Event("degree", datetime(2010, 6, 1), "education"),
    Event("job_change", datetime(2015, 3, 1), "occupation"),
]
sequence = EventSequence(person_id="person_001", events=events)

# Analyze life course
results = analyze_life_course([sequence], outcomes=None)

Protein Analysis

from metainformant.protein import sequences, alignment, structure

# Read protein sequences
proteins = sequences.read_fasta("data/proteins.fasta")

# Pairwise alignment
align_result = alignment.global_align(proteins["seq1"], proteins["seq2"])

# Structure analysis
structure_data = structure.load_pdb("data/structure.pdb")
contacts = structure.analyze_contacts(structure_data)

Epigenome Analysis

from metainformant.epigenome import methylation, chipseq

# Methylation analysis
meth_data = methylation.load_bedgraph("data/methylation.bedgraph")
regions = methylation.find_dmr(meth_data, threshold=0.3)

# ChIP-seq peak calling
peaks = chipseq.call_peaks("data/chipseq.bam", "data/control.bam")

Ontology Analysis

from metainformant.ontology.core import go
from metainformant.ontology.query import query

# Load Gene Ontology
go_graph = go.load_obo("data/go.obo")

# Query ontology
terms = query.get_ancestors(go_graph, "GO:0008150")
similarity = query.semantic_similarity(go_graph, "GO:0008150", "GO:0008151")

Phenotype Analysis

from metainformant.phenotype import life_course, antwiki

# Life course analysis
traits = life_course.load_traits("data/traits.csv")
curated = life_course.curate_traits(traits)

# AntWiki integration
species_data = antwiki.fetch_species("Pogonomyrmex_barbatus")

Ecology Analysis

from metainformant.ecology import community, environmental

# Community analysis
species_matrix = community.load_matrix("data/species.csv")
diversity = community.calculate_diversity(species_matrix)

# Environmental data
env_data = environmental.load_data("data/environment.csv")
correlations = environmental.analyze_correlations(species_matrix, env_data)

Mathematical Biology

from metainformant.math import popgen, coalescent

# Population genetics
sequences = ["ATCGATCG", "ATCGTTCG", "ATCGATCG"]
fst = popgen.fst(sequences, populations=[0, 0, 1])

# Coalescent simulation
tree = coalescent.simulate_coalescent(n_samples=10, Ne=1000)

Single-Cell Analysis

from metainformant.singlecell import preprocessing, clustering

# Load single-cell data
adata = preprocessing.load_h5ad("data/counts.h5ad")

# Preprocessing
adata = preprocessing.filter_cells(adata, min_genes=200)
adata = preprocessing.normalize(adata)

# Clustering
clusters = clustering.leiden(adata, resolution=0.5)

Quality Control

from metainformant.quality import fastq, metrics

# FASTQ quality assessment
qc_report = fastq.assess_quality("data/reads.fastq")
print(f"Mean quality: {qc_report['mean_quality']}")

# General metrics
quality_score = metrics.calculate_quality(data_matrix)

Machine Learning

from metainformant.ml import classification, features

# Feature extraction
features = features.extract_features(data, method="pca", n_components=50)

# Classification
model = classification.train_classifier(
    X_train, y_train, method="random_forest"
)
predictions = model.predict(X_test)

Simulation

from metainformant.simulation import sequences, ecosystems

# Sequence simulation
sim_seqs = sequences.simulate_sequences(
    n_sequences=100, length=1000, mutation_rate=0.01
)

# Ecosystem simulation
ecosystem = ecosystems.simulate_community(
    n_species=50, interactions="random"
)

Core Utilities

from metainformant.core import io, paths, logging

# I/O operations
data = io.load_json("config/example.yaml")
io.dump_json(results, "output/results.json")

# Path handling
resolved = paths.expand_and_resolve("~/data/input.txt")
is_safe = paths.is_within(resolved, base_path="/safe/directory")

# Logging
logger = logging.get_logger(__name__)
logger.info("Processing data")

Development

Getting started? Read SETUP.md first.

Running Tests

# All tests
bash scripts/package/test.sh

# Fast tests only
bash scripts/package/test.sh --mode fast

# Specific module
pytest tests/dna/ -v

Code Quality

# Check code quality
bash scripts/package/uv_quality.sh

# Run linting
ruff check src/

# Type checking
mypy src/metainformant

Project Structure

MetaInformAnt/
 src/metainformant/ # Main package
 core/ # Core utilities
 dna/ # DNA analysis
 rna/ # RNA analysis
 protein/ # Protein analysis
 gwas/ # GWAS analysis
 ... # Additional modules
 scripts/ # Workflow scripts
 package/ # Package management
 rna/ # RNA workflows
 gwas/ # GWAS workflows
 ... # Module scripts
 docs/ # Documentation
 tests/ # Test suite
 config/ # Configuration files
 output/ # Analysis outputs
 data/ # Input data

AI-Assisted Development

This project was developed with AI assistance (grok-code-fast-1 via Cursor) to enhance:

Code generation and algorithm implementation
Comprehensive documentation
Test case generation
Architecture design

All AI-generated content undergoes human review. See AGENTS.md for details.

Known Limitations

Module Completeness

Some modules have partial implementations or optional dependencies:

Machine Learning: Framework exists; some methods may need completion (see ML Documentation)
Multi-omics: Integration methods implemented; additional dependencies may be required
Single-cell: Requires scipy, scanpy, anndata (see Single-Cell Documentation)
Network Analysis: Algorithms implemented; regulatory network features may need enhancement

GWAS Module

Variant Download: Database download (dbSNP, 1000 Genomes) is a placeholder; use SRA-based workflow or provide VCF files
Functional Annotation: Requires external tools (ANNOVAR, VEP, SnpEff) for variant annotation
Mixed Models: Relatedness adjustment implemented; MLM methods may require GCTA/EMMAX integration

Test Coverage

Some modules have lower test success rates due to optional dependencies:

Single-cell: Requires scientific dependencies (scanpy, anndata)
Multi-omics: Framework exists, tests may skip without dependencies
Network Analysis: Tests pass; features may need additional setup

See Testing Guide for detailed testing documentation and coverage information.

Best Practices

File Naming

Use informative names: sample_pca_biplot_colored_by_treatment.png
Avoid generic names: plot1.png, output.png

Output Organization

All outputs in output/ directory
Configuration saved with results
Visualizations in subdirectories with metadata

No Mocking Policy

All tests use implementations
No fake/mocked/stubbed methods
Real API calls or graceful skips
Ensures actual functionality

Requirements

Python 3.11+
Optional: SRA Toolkit, kallisto (for RNA workflows)
Optional: samtools, bcftools, bwa (for GWAS)

Contributing

See CONTRIBUTING.md for full contribution guidelines.

Contributions are welcome! Please:

Follow the existing code style
Add tests for new features
Update documentation
Use informative commit messages

Recent Improvements

Performance Enhancements

Intelligent Caching: Automatic caching for expensive computations (Tajima's constants, entropy calculations)
NumPy Vectorization: Optimized mathematical operations for 10-100x performance improvements
Progress Tracking: Real-time progress bars for long-running analyses
Memory Optimization: Efficient algorithms for large datasets
Resilient Orchestration: Engineered automatic recovery flows and VM-level hard reset protocols to survive catastrophic 100% Docker overlay lockups caused by hidden fasterq-dump caches.

Enhanced Documentation

Comprehensive Tutorials: End-to-end guides for DNA, RNA, GWAS, and information theory workflows
Method Comparison Guides: Decision-making guides for choosing analysis algorithms
Extended FAQ: Troubleshooting and usage guidance for common scenarios
Standardized Docstrings: Consistent formatting with examples and DOI citations

Testing & Reliability

Expanded Test Coverage: 37+ new comprehensive tests with real implementations
Validation Enhancements: Improved parameter validation and error handling
Cross-Platform Compatibility: Python 3.14 support and external drive optimization
Integration Testing: Verified cross-module functionality

New Features

Enhanced GWAS Visualization: Complete visualization suite for population structure, effects, and comparisons
Information Theory Workflows: Batch processing with progress tracking
Protein Proteome Analysis: Taxonomy ID processing and proteome utilities
Advanced Error Handling: Structured error reporting with actionable guidance

Citation

If you use METAINFORMANT in your research, please cite this repository:

@software{metainformant2025,
  author = {MetaInformAnt Development Team},
  title = {MetaInformAnt: Comprehensive Bioinformatics Toolkit},
  year = {2025},
  url = {https://github.com/docxology/MetaInformAnt},
  version = {0.2.6}
}

License

This project is licensed under the Apache License, Version 2.0 - see LICENSE for details.

Contact

Repository: https://github.com/docxology/MetaInformAnt
Issues: https://github.com/docxology/MetaInformAnt/issues
Documentation: https://github.com/docxology/MetaInformAnt/blob/main/docs/

Acknowledgments

Developed with AI assistance from Cursor's Code Assistant (grok-code-fast-1)
Built on established bioinformatics tools and libraries
Community contributions and feedback

Status: Active Development | Version: 0.2.6 | Python: 3.11+ | License: Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 312 Commits
.agents/workflows		.agents/workflows
.cursor/skills		.cursor/skills
.github/workflows		.github/workflows
Plans		Plans
config		config
cursorrules		cursorrules
docs		docs
examples		examples
projects		projects
scripts		scripts
src		src
tests		tests
.Rprofile		.Rprofile
.cursorrules		.cursorrules
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
PAI.md		PAI.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SPEC.md		SPEC.md
TODO.md		TODO.md
pharma_api_reference.json		pharma_api_reference.json
pyproject.toml		pyproject.toml
temp_api_dump.json		temp_api_dump.json

Folders and files

Latest commit

History

Repository files navigation

METAINFORMANT

Overview

At a Glance

Core Capabilities

System Architecture

Data Flow and Integration Architecture

Multi-Omic Integration Pipeline

Quality Assurance Framework

Key Features

Quick Start

I Want To...

Choosing the Right Module

First-Time Visitor Path

Prerequisites

Module Status Overview

Production-Ready Modules

Functional Modules (Partial Implementation)

Specialized Domain Modules

Module Overview

Complete Module Reference

Documentation

Quick Links

Transcriptomics (RNA-seq)

Module Documentation

Scripts & Workflows

CLI Interface

Usage Examples

DNA Analysis

RNA-seq Workflow

GWAS Analysis

eQTL Integration Pipeline

Visualization

Network Analysis

Multi-Omics Integration

Information Theory

Life Events Analysis

Protein Analysis

Epigenome Analysis

Ontology Analysis

Phenotype Analysis

Ecology Analysis

Mathematical Biology

Single-Cell Analysis

Quality Control

Machine Learning

Simulation

Core Utilities

Development

Running Tests

Code Quality

Project Structure

AI-Assisted Development

Known Limitations

Module Completeness

GWAS Module

Test Coverage

Best Practices

File Naming

Output Organization

No Mocking Policy

Requirements

Contributing

Recent Improvements

Performance Enhancements

Enhanced Documentation

Testing & Reliability

New Features

Citation

License

Contact

Acknowledgments

About

Topics

Resources

License

Contributing

Packages