Comprehensive bioinformatics toolkit for multi-omic analysis
METAINFORMANT provides production-ready bioinformatics analysis across genomics, transcriptomics, proteomics, epigenomics, and systems biology. Built with Python 3.11+ and uv for fast dependency management.
| Metric | Value |
|---|---|
| Modules | 28 specialized analysis modules |
| Python Files | 603 implementation files |
| Plot Types | 70+ visualization methods |
| Documentation | 310+ README files |
| Domain | Features |
|---|---|
| DNA | Sequences, alignment, phylogenetics, population genetics, variant analysis |
| RNA | Amalgkit integration, ENA/SRA downloads, Kallisto quantification, industrial-scale pipelines (8,300+ samples across 28 species) |
| GWAS | Association testing, fine-mapping, visualization, complete GWAS pipelines |
| eQTL | Integration of GWAS variants and Amalgkit RNA-seq expression data |
| Multi-omics | Cross-omic integration, joint PCA, correlation analysis |
| ML | Classification, regression, feature selection, LLM integration |
| Visualization | Manhattan plots, heatmaps, networks, animations, publication-ready output |
flowchart TB
subgraph coreInfra["Core Infrastructure"]
CORE["Core Utilities"]
end
subgraph molecular["Molecular Analysis"]
DNA["DNA Analysis"]
RNA["RNA Analysis"]
PROT["Protein Analysis"]
EPI["Epigenome Analysis"]
end
subgraph statsML["Statistical and ML"]
GWAS["GWAS Analysis"]
MATH["Mathematical Biology"]
ML["Machine Learning"]
INFO["Information Theory"]
end
subgraph systems["Systems Biology"]
NET["Network Analysis"]
MULTI["Multi-Omics Integration"]
SC["Single-Cell Analysis"]
SIM["Simulation"]
end
subgraph annotation["Annotation and Metadata"]
ONT["Ontology"]
PHEN["Phenotype Analysis"]
ECO["Ecology"]
LE["Life Events"]
end
subgraph utilities["Utilities"]
QUAL["Quality Control"]
VIZ["Visualization"]
end
subgraph specialized["Specialized Domains"]
LR["Long-Read Sequencing"]
METAG["Metagenomics"]
SV["Structural Variants"]
SPATIAL["Spatial Transcriptomics"]
PHARMA["Pharmacogenomics"]
METAB["Metabolomics"]
MENU["Menu System"]
CLOUD["Cloud Deployment"]
end
CORE --> DNA
CORE --> RNA
CORE --> PROT
CORE --> EPI
CORE --> GWAS
CORE --> MATH
CORE --> ML
CORE --> INFO
CORE --> NET
CORE --> MULTI
CORE --> SC
CORE --> SIM
CORE --> ONT
CORE --> PHEN
CORE --> ECO
CORE --> LE
CORE --> QUAL
CORE --> VIZ
CORE --> LR
CORE --> METAG
CORE --> SV
CORE --> SPATIAL
CORE --> PHARMA
CORE --> METAB
CORE --> MENU
CORE --> CLOUD
graph TD
A["Raw Biological Data"] --> B["Data Ingestion"]
B --> C{Data Type}
C -->|DNA| D["DNA Module"]
C -->|RNA| E["RNA Module"]
C -->|Protein| F["Protein Module"]
C -->|Epigenome| G["Epigenome Module"]
C -->|Phenotype| H["Phenotype Module"]
C -->|Environmental| I["Ecology Module"]
D --> J["Quality Control"]
E --> J
F --> J
G --> J
H --> J
I --> J
J --> K["Core Processing"]
K --> L{Analysis Type}
L -->|Statistical| M["GWAS Module"]
L -->|ML| N["ML Module"]
L -->|Information| O["Information Module"]
L -->|Networks| P["Networks Module"]
L -->|Systems| Q["Multi-Omics Module"]
L -->|Singlecell| R["Single-Cell Module"]
L -->|Simulation| S["Simulation Module"]
M --> T["Results Integration"]
N --> T
O --> T
P --> T
Q --> T
R --> T
S --> T
T --> U["Visualization"]
U --> V["Publication Figures"]
V --> W["Scientific Insights"]
subgraph "Primary Data Types"
X["Genomic"] -.-> D
Y["Transcriptomic"] -.-> E
Z["Proteomic"] -.-> F
AA["Epigenetic"] -.-> G
end
subgraph "Analysis Workflows"
BB["Population Genetics"] -.-> M
CC["Feature Selection"] -.-> N
DD["Mutual Information"] -.-> O
EE["Community Detection"] -.-> P
FF["Joint PCA"] -.-> Q
GG["Trajectory Analysis"] -.-> R
end
subgraph "Output Formats"
HH["Manhattan Plots"] -.-> V
II["Heatmaps"] -.-> V
JJ["Network Graphs"] -.-> V
KK["Animations"] -.-> V
end
graph TD
A["Multi-Omic Datasets"] --> B["Sample Alignment"]
B --> C["Batch Effect Correction"]
C --> D{Integration Strategy}
D -->|Early| E["Concatenated Matrix"]
D -->|Late| F["Separate Models"]
D -->|Intermediate| G["Meta-Analysis"]
E --> H["Joint Dimensionality Reduction"]
F --> I["Individual Analysis"]
G --> J["Result Integration"]
H --> K["Unified Clustering"]
I --> L["Individual Clustering"]
J --> M["Consensus Clustering"]
K --> N["Functional Enrichment"]
L --> N
M --> N
N --> O["Pathway Analysis"]
O --> P["Network Construction"]
P --> Q["Biological Interpretation"]
Q --> R["Systems Biology Insights"]
subgraph "Omic Layers"
S["Genomics"] -.-> A
T["Transcriptomics"] -.-> A
U["Proteomics"] -.-> A
V["Metabolomics"] -.-> A
W["Epigenomics"] -.-> A
end
subgraph "Integration Methods"
X["MOFA"] -.-> H
Y["Joint PCA"] -.-> H
Z["Similarity Networks"] -.-> H
end
subgraph "Biological Outputs"
AA["Gene Modules"] -.-> Q
BB["Regulatory Networks"] -.-> Q
CC["Disease Pathways"] -.-> Q
DD["Biomarkers"] -.-> Q
end
graph TD
A["Data Processing Pipeline"] --> B["Input Validation"]
B --> C["Type Checking"]
C --> D["Schema Validation"]
D --> E["Processing Logic"]
E --> F["Error Handling"]
F --> G["Recovery Mechanisms"]
G --> H["Output Validation"]
H --> I["Result Verification"]
I --> J["Quality Metrics"]
J --> K{Acceptable Quality?}
K -->|Yes| L["Pipeline Success"]
K -->|No| M["Quality Issues"]
M --> N["Diagnostic Analysis"]
N --> O["Error Classification"]
O --> P{Recoverable?}
P -->|Yes| Q["Data Correction"]
P -->|No| R["Pipeline Failure"]
Q --> E
L --> S["Validated Results"]
R --> T["Error Reporting"]
subgraph "Validation Layers"
U["Data Integrity"] -.-> B
V["Business Logic"] -.-> E
W["Statistical Validity"] -.-> H
end
subgraph "Quality Controls"
X["Unit Tests"] -.-> F
Y["Integration Tests"] -.-> I
Z["Performance Benchmarks"] -.-> J
end
subgraph "Error Types"
AA["Data Errors"] -.-> O
BB["Logic Errors"] -.-> O
CC["System Errors"] -.-> O
DD["External Errors"] -.-> O
end
- Multi-Omic Analysis: DNA, RNA, protein, and epigenome data integration
- Statistical & ML Methods: GWAS, population genetics, machine learning pipelines
- Single-Cell Genomics: Complete scRNA-seq analysis workflows
- Network Analysis: Biological networks, pathways, community detection algorithms
- Visualization Suite: 14 specialized plotting modules with 70+ plot types and publication-quality output
- Modular Architecture: Individual modules or complete end-to-end workflows
- Comprehensive Documentation: 310+ README files with technical specifications
- Implementation Testing: Real methods in tests, no mocks or stubs
- Quality Assurance: Rigorous validation and error handling throughout
- Performance Optimization: Efficient algorithms for large-scale biological data
Analyze DNA sequences:
# One-liner: GC content, k-mer analysis, phylogeny
uv run python -c "
from metainformant.dna import sequences, composition, phylogeny
seqs = sequences.read_fasta('data/sequences.fasta')
gc = [composition.gc_content(s) for s in seqs.values()]
print(f'Avg GC: {sum(gc)/len(gc):.1f}%')
"Run RNA-seq pipeline (amalgkit):
# 28-species workflow in ~6 hours on n1-standard-16
python3 scripts/rna/orchestrate_species.py \
--species-list config/hymenoptera_28_species.txt \
--output output/rna_complete/Perform GWAS analysis:
# Association testing with population structure correction
python3 scripts/gwas/pipelines/run_analysis.py \
--vcf data/genotypes.vcf.gz \
--pheno data/phenotypes.tsv \
--config config/gwas/amellifera.yamlVisualize results:
from metainformant.visualization import plots
fig = plots.manhattan(gwas_results) # or heatmap, network, tree...
fig.savefig('output/figures/manhattan.png', dpi=300)Deploy to cloud (GCP):
# Spin up VM, run pipeline, collect results, tear down
python3 scripts/cloud/deploy_gcp.py --config config/cloud.yaml| Your Data Type | Use This Module | Start Here |
|---|---|---|
| DNA sequences (FASTA) | dna |
docs/dna/ |
| RNA-seq (FASTQ, BAM) | rna (amalgkit) |
docs/rna/ |
| VCF + phenotypes | gwas |
docs/gwas/workflow.md |
| Protein (FASTA, PDB) | protein |
docs/protein/ |
| Single-cell (h5ad, mtx) | singlecell |
docs/singlecell/ |
| Methylation arrays/bams | epigenome |
docs/epigenome/ |
| Microbiome (16S, metagenome) | metagenomics |
docs/metagenomics/ |
| Multiple omics (joint analysis) | multiomics |
docs/multiomics/ |
| Gene lists + GO terms | ontology |
docs/ontology/ |
| Phenotype traits | phenotype |
docs/phenotype/ |
| Ecological communities | ecology |
docs/ecology/ |
| Long-read (PacBio/ONT) | longread |
docs/longread/ |
| Networks & pathways | networks |
docs/networks/ |
| Information theory analysis | information |
docs/information/ |
| Simulation/synthetic data | simulation |
docs/simulation/ |
| Visualizations only | visualization |
docs/visualization/ |
| GCP cloud deployment | cloud |
src/metainformant/cloud/README.md |
Not sure? Read the full module matrix.
- Install (10 min): Follow QUICKSTART.md
- Run demo (2 min):
python3 scripts/core/run_demo.py - Pick your domain: See table above → click module link
- Read workflow guide: Each module's
docs/<module>/workflow.md - Try on sample data: Each module has
tests/data/<module>/examples - Run on your data: Replace sample paths with your files
| Category | Module | Status | Key Features |
|---|---|---|---|
| Core | core/ | [DONE] Complete | I/O, config, logging, parallel, cache, validation, workflow orchestration |
| DNA | dna/ | [DONE] Complete | Sequences, alignment, phylogeny, population genetics, variant analysis |
| RNA | rna/ | [DONE] Complete & Verified | AMALGKIT integration, workflow orchestration, expression quantification |
| Protein | protein/ | [DONE] Complete | Sequences, structures, AlphaFold, UniProt, functional analysis |
| GWAS | gwas/ | [DONE] Complete | Association testing, QC, population structure, visualization |
| Math | math/ | [DONE] Complete | Population genetics, coalescent, selection, epidemiology |
| Visualization | visualization/ | [DONE] Complete | 70+ plot types, animations, publication-quality output |
| Ontology | ontology/ | [DONE] Complete | GO analysis, semantic similarity, functional annotation |
| Quality | quality/ | [DONE] Complete | FASTQ analysis, validation, contamination detection |
| Category | Module | Status | Key Features | Coverage |
|---|---|---|---|---|
| ML | ml/ | [PARTIAL] Partial | Classification, regression, feature selection | 75% |
| Networks | networks/ | [PARTIAL] Partial | Graph algorithms, community detection | 78% |
| Multi-Omics | multiomics/ | [PARTIAL] Partial | Integration, joint PCA, correlation | 72% |
| Single-Cell | singlecell/ | [PARTIAL] Partial | Preprocessing, clustering, DE analysis | 74% |
| Epigenome | epigenome/ | [PARTIAL] Partial | Methylation, ChIP-seq, ATAC-seq | 76% |
| Phenotype | phenotype/ | [PARTIAL] Partial | AntWiki integration, trait analysis | 79% |
| Ecology | ecology/ | [PARTIAL] Partial | Community diversity, environmental | 77% |
| Life Events | life_events/ | [PARTIAL] Partial | Event sequences, embeddings | 73% |
| Simulation | simulation/ | [PARTIAL] Partial | Sequence simulation, ecosystems | 71% |
| Information | information/ | [PARTIAL] Partial | Entropy, mutual information | 80% |
| Category | Module | Status | Key Features | Coverage |
|---|---|---|---|---|
| Long-Read | longread/ | [PARTIAL] Partial | PacBio/ONT sequencing, assembly, error correction | 65% |
| Metagenomics | metagenomics/ | [PARTIAL] Partial | Taxonomic profiling, functional annotation | 60% |
| Structural Variants | structural_variants/ | [PARTIAL] Partial | SV/CNV detection, breakpoint resolution | 55% |
| Spatial | spatial/ | [PARTIAL] Partial | Spatial transcriptomics, tissue mapping | 50% |
| Pharmacogenomics | pharmacogenomics/ | [PARTIAL] Partial | Drug-gene interactions, variant interpretation | 55% |
| Metabolomics | metabolomics/ | [PARTIAL] Partial | MS data processing, pathway mapping | 50% |
| Menu | menu/ | [PARTIAL] Partial | Interactive CLI menu, workflow navigation | 70% |
| Cloud | cloud/ | [DONE] Complete | GCP VM lifecycle, Docker pipelines, genome prep | 90% |
| eQTL | gwas/finemapping/eqtl | [DONE] Complete | Expression-genotype association, cis-eQTL scanning | 85% |
| MCP | mcp/ | [PARTIAL] Partial | Model Context Protocol tool implementations | 40% |
All modules live in src/metainformant/ with documentation in each module's README.md.
| Module | Files | Description | Key Components | Docs |
|---|---|---|---|---|
| Core Infrastructure | ||||
core/ |
26 | Shared utilities, I/O, logging, config, parallel processing, caching | io/, data/, execution/ |
README |
| Molecular Analysis | ||||
dna/ |
27 | DNA sequences, alignment, phylogenetics, population genetics, variants | sequence/, alignment/, population/ |
README |
rna/ |
29 | RNA-seq workflows, amalgkit integration, expression quantification | amalgkit/, engine/, analysis/ |
README |
protein/ |
17 | Protein sequences, structure analysis, AlphaFold, UniProt integration | sequence/, structure/, database/ |
README |
epigenome/ |
8 | Methylation analysis, ChIP-seq, ATAC-seq, chromatin accessibility | assays/, chromatin_state/, peak_calling/ |
README |
| Statistical & ML | ||||
gwas/ |
39 | GWAS, fine-mapping, eQTL analysis, colocalization, visualization | finemapping/, visualization/, analysis/ |
README |
math/ |
20 | Population genetics theory, coalescent, selection, epidemiology | population_genetics/, epidemiology/, evolutionary_dynamics/ |
README |
ml/ |
12 | Machine learning pipelines, classification, regression, features | models/, features/, llm/ |
README |
information/ |
14 | Information theory, Shannon entropy, mutual information, semantic similarity | metrics/, integration/ |
README |
| Systems Biology | ||||
networks/ |
9 | Biological networks, graph algorithms, community detection, pathways | analysis/, interaction/ |
README |
multiomics/ |
6 | Multi-omic integration, joint PCA, cross-omic correlation | analysis/, methods/ |
README |
singlecell/ |
9 | scRNA-seq preprocessing, clustering, differential expression | data/, analysis/, visualization/ |
README |
simulation/ |
7 | Synthetic data, agent-based models, sequence simulation, ecosystems | models/, workflow/, benchmark/ |
README |
| Annotation & Metadata | ||||
ontology/ |
7 | Gene Ontology, functional annotation, semantic similarity | core/, query/, visualization/ |
README |
phenotype/ |
15 | Phenotypic data curation, AntWiki integration, trait analysis | analysis/, data/, behavior/ |
README |
ecology/ |
7 | Community diversity, environmental correlations, species matrices | analysis/, phylogenetic/, visualization/ |
README |
life_events/ |
9 | Life course analysis, event sequences, temporal embeddings | models/, workflow/ |
README |
| Utilities | ||||
quality/ |
4 | FASTQ quality assessment, validation, contamination detection | io/, analysis/, reporting/ |
README |
visualization/ |
22 | 70+ plot types, heatmaps, networks, animations, publication-ready | plots/, genomics/, analysis/ |
README |
| Specialized Domains | ||||
longread/ |
19 | Long-read sequencing (PacBio, ONT), assembly, error correction | assembly/, quality/ |
README |
metagenomics/ |
11 | Metagenomic analysis, taxonomic profiling, functional annotation | taxonomy/, functional/ |
README |
pharmacogenomics/ |
12 | Drug-gene interactions, pharmacokinetics, variant interpretation | interactions/ |
README |
spatial/ |
11 | Spatial transcriptomics, tissue mapping, spatial statistics | analysis/ |
README |
structural_variants/ |
9 | SV detection, CNV analysis, breakpoint resolution | detection/ |
README |
menu/ |
4 | Interactive CLI menu system, workflow navigation | ui/ |
README |
Total: 26 modules, 603 Python files
- Documentation Guide - Complete navigation guide
- Quick Start - Fast setup commands
- Architecture - System design
- Technical Specification - Design standards
- Workflow Guide — ENA-first amalgkit streaming pipeline
- Troubleshooting — IO contention & SRA setup fixes
- Tissue Patching — Custom metadata correction
- Ortholog Generation — Automated cross-species mapping
- Step Documentation — The 11-step amalgkit process
- Testing Guide - Comprehensive testing documentation
- CLI Reference - Command-line interface
- eQTL Integration - eQTL pipeline documentation
Each module has documentation in src/metainformant/<module>/README.md and docs/<module>/.
The scripts/ directory contains production-ready workflow orchestrators:
- Package Management: Setup, testing, quality control
- RNA-seq (Amalgkit): Multi-species workflows, amalgkit integration
- GWAS (Variants): Genome-scale association studies
- eQTL Integration: RNA-seq + Variant cross-omics integration pipelines
- Module Orchestrators: Complete workflow scripts for all domains (core, DNA, RNA, protein, networks, multiomics, single-cell, quality, simulation, visualization, epigenome, ecology, ontology, phenotype, ML, math, gwas, information, life_events)
See scripts/README.md for documentation.
The metainformant command exposes a small CLI (docs/cli.md): --version, --modules, protein (taxon-ids, comp, rmsd-ca), quality batch-detect, rna info, gwas info. Full domain workflows use Python imports, scripts/*/run_*.py, or python -m metainformant.rna.amalgkit.
uv run metainformant --help
uv run metainformant --modules
uv run metainformant protein taxon-ids --file data/taxon_ids.txt
uv run metainformant protein comp --fasta data/proteins.fasta
uv run metainformant protein rmsd-ca --pdb-a data/structure1.pdb --pdb-b data/structure2.pdb
uv run metainformant quality batch-detect --data samples.csv --batches batches.txt
# RNA-seq (config-driven script; see docs/cli.md for Python API)
python3 scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yamlSee docs/cli.md for CLI documentation.
from metainformant.dna import alignment, population
# Pairwise alignment
align_result = alignment.pairwise.global_align("ACGTACGT", "ACGTAGGT")
print(f"Score: {align_result.score}")
# Population genetics
sequences = ["ATCGATCG", "ATCGTTCG", "ATCGATCG"]
diversity = population.nucleotide_diversity(sequences)
print(f"π = {diversity:.4f}")from pathlib import Path
from metainformant.rna.amalgkit import check_cli_available
from metainformant.rna.engine.workflow import AmalgkitWorkflowConfig, execute_workflow, plan_workflow
available, help_text = check_cli_available()
if not available:
print(f"Amalgkit not available: {help_text}")
config = AmalgkitWorkflowConfig(
work_dir=Path("output/amalgkit/work"),
threads=8,
species_list=["Apis_mellifera"],
)
steps = plan_workflow(config)
print(f"Planned {len(steps)} workflow steps")
results = execute_workflow(config)
for step_result in results.steps_executed:
print(f"{step_result.step_name}: exit code {step_result.return_code}")# End-to-end workflow for a single species (recommended)
python3 scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml
# Check status
python3 scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml --status
# Alternative: Bash-based orchestrator
bash scripts/rna/amalgkit/run_amalgkit.sh --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yamlfrom metainformant.gwas import manhattan_plot, run_gwas
results = run_gwas(
vcf_path="data/variants/cohort.vcf.gz",
phenotype_path="data/phenotypes/traits.tsv",
config={"association": {"model": "linear"}},
output_dir="output/gwas"
)
# Visualize results
manhattan_plot(results["association_results"], output_path="output/gwas/manhattan.png")The eQTL pipeline bridges the genomic variants from the GWAS pipeline with the gene expression matrices provided by the Amalgkit (RNA) pipeline.
# Run the pipeline leveraging real Amalgkit RNA-seq quantification data
uv run python scripts/eqtl/run_eqtl_real.py
# Or explore the logic with synthetic data
uv run python scripts/eqtl/run_eqtl_demo.py
# Call SNP variants directly from transcriptome RNA-seq data
uv run python scripts/eqtl/rna_snp_pipeline.py --species amellifera --n-samples 3from metainformant.visualization import heatmap, animate_time_series
# Heatmap
heatmap(correlation_matrix, cmap="viridis", annot=True)
# Animation
fig, anim = animate_time_series(time_series_data)
anim.save("output/animation.gif")from metainformant.networks import create_network, detect_communities, centrality_measures
# Create network from interactions
network = create_network(edges, directed=False)
# Detect communities
communities = detect_communities(network)
# Calculate centrality
centrality = centrality_measures(network)from metainformant.multiomics import integrate_omics_data, joint_pca
# Integrate multiple omics datasets
multiomics = integrate_omics_data(
genomics=genomics_data,
transcriptomics=rna_data,
proteomics=protein_data
)
# Joint dimensionality reduction
pca_result = joint_pca(multiomics)from metainformant.information import shannon_entropy, mutual_information, information_content
# Calculate Shannon entropy
probs = [0.5, 0.3, 0.2]
entropy = shannon_entropy(probs)
# Mutual information between sequences
mi = mutual_information(sequence_x, sequence_y)
# Information content for hierarchical terms
ic = information_content(term_frequencies, "GO:0008150")from metainformant.life_events import EventSequence, Event, analyze_life_course
from datetime import datetime
# Create event sequences
events = [
Event("degree", datetime(2010, 6, 1), "education"),
Event("job_change", datetime(2015, 3, 1), "occupation"),
]
sequence = EventSequence(person_id="person_001", events=events)
# Analyze life course
results = analyze_life_course([sequence], outcomes=None)from metainformant.protein import sequences, alignment, structure
# Read protein sequences
proteins = sequences.read_fasta("data/proteins.fasta")
# Pairwise alignment
align_result = alignment.global_align(proteins["seq1"], proteins["seq2"])
# Structure analysis
structure_data = structure.load_pdb("data/structure.pdb")
contacts = structure.analyze_contacts(structure_data)from metainformant.epigenome import methylation, chipseq
# Methylation analysis
meth_data = methylation.load_bedgraph("data/methylation.bedgraph")
regions = methylation.find_dmr(meth_data, threshold=0.3)
# ChIP-seq peak calling
peaks = chipseq.call_peaks("data/chipseq.bam", "data/control.bam")from metainformant.ontology.core import go
from metainformant.ontology.query import query
# Load Gene Ontology
go_graph = go.load_obo("data/go.obo")
# Query ontology
terms = query.get_ancestors(go_graph, "GO:0008150")
similarity = query.semantic_similarity(go_graph, "GO:0008150", "GO:0008151")from metainformant.phenotype import life_course, antwiki
# Life course analysis
traits = life_course.load_traits("data/traits.csv")
curated = life_course.curate_traits(traits)
# AntWiki integration
species_data = antwiki.fetch_species("Pogonomyrmex_barbatus")from metainformant.ecology import community, environmental
# Community analysis
species_matrix = community.load_matrix("data/species.csv")
diversity = community.calculate_diversity(species_matrix)
# Environmental data
env_data = environmental.load_data("data/environment.csv")
correlations = environmental.analyze_correlations(species_matrix, env_data)from metainformant.math import popgen, coalescent
# Population genetics
sequences = ["ATCGATCG", "ATCGTTCG", "ATCGATCG"]
fst = popgen.fst(sequences, populations=[0, 0, 1])
# Coalescent simulation
tree = coalescent.simulate_coalescent(n_samples=10, Ne=1000)from metainformant.singlecell import preprocessing, clustering
# Load single-cell data
adata = preprocessing.load_h5ad("data/counts.h5ad")
# Preprocessing
adata = preprocessing.filter_cells(adata, min_genes=200)
adata = preprocessing.normalize(adata)
# Clustering
clusters = clustering.leiden(adata, resolution=0.5)from metainformant.quality import fastq, metrics
# FASTQ quality assessment
qc_report = fastq.assess_quality("data/reads.fastq")
print(f"Mean quality: {qc_report['mean_quality']}")
# General metrics
quality_score = metrics.calculate_quality(data_matrix)from metainformant.ml import classification, features
# Feature extraction
features = features.extract_features(data, method="pca", n_components=50)
# Classification
model = classification.train_classifier(
X_train, y_train, method="random_forest"
)
predictions = model.predict(X_test)from metainformant.simulation import sequences, ecosystems
# Sequence simulation
sim_seqs = sequences.simulate_sequences(
n_sequences=100, length=1000, mutation_rate=0.01
)
# Ecosystem simulation
ecosystem = ecosystems.simulate_community(
n_species=50, interactions="random"
)from metainformant.core import io, paths, logging
# I/O operations
data = io.load_json("config/example.yaml")
io.dump_json(results, "output/results.json")
# Path handling
resolved = paths.expand_and_resolve("~/data/input.txt")
is_safe = paths.is_within(resolved, base_path="/safe/directory")
# Logging
logger = logging.get_logger(__name__)
logger.info("Processing data")Getting started? Read SETUP.md first.
# All tests
bash scripts/package/test.sh
# Fast tests only
bash scripts/package/test.sh --mode fast
# Specific module
pytest tests/dna/ -v# Check code quality
bash scripts/package/uv_quality.sh
# Run linting
ruff check src/
# Type checking
mypy src/metainformantMetaInformAnt/
src/metainformant/ # Main package
core/ # Core utilities
dna/ # DNA analysis
rna/ # RNA analysis
protein/ # Protein analysis
gwas/ # GWAS analysis
... # Additional modules
scripts/ # Workflow scripts
package/ # Package management
rna/ # RNA workflows
gwas/ # GWAS workflows
... # Module scripts
docs/ # Documentation
tests/ # Test suite
config/ # Configuration files
output/ # Analysis outputs
data/ # Input data
This project was developed with AI assistance (grok-code-fast-1 via Cursor) to enhance:
- Code generation and algorithm implementation
- Comprehensive documentation
- Test case generation
- Architecture design
All AI-generated content undergoes human review. See AGENTS.md for details.
Some modules have partial implementations or optional dependencies:
- Machine Learning: Framework exists; some methods may need completion (see ML Documentation)
- Multi-omics: Integration methods implemented; additional dependencies may be required
- Single-cell: Requires
scipy,scanpy,anndata(see Single-Cell Documentation) - Network Analysis: Algorithms implemented; regulatory network features may need enhancement
- Variant Download: Database download (dbSNP, 1000 Genomes) is a placeholder; use SRA-based workflow or provide VCF files
- Functional Annotation: Requires external tools (ANNOVAR, VEP, SnpEff) for variant annotation
- Mixed Models: Relatedness adjustment implemented; MLM methods may require GCTA/EMMAX integration
Some modules have lower test success rates due to optional dependencies:
- Single-cell: Requires scientific dependencies (
scanpy,anndata) - Multi-omics: Framework exists, tests may skip without dependencies
- Network Analysis: Tests pass; features may need additional setup
See Testing Guide for detailed testing documentation and coverage information.
- Use informative names:
sample_pca_biplot_colored_by_treatment.png - Avoid generic names:
plot1.png,output.png
- All outputs in
output/directory - Configuration saved with results
- Visualizations in subdirectories with metadata
- All tests use implementations
- No fake/mocked/stubbed methods
- Real API calls or graceful skips
- Ensures actual functionality
- Python 3.11+
- Optional: SRA Toolkit, kallisto (for RNA workflows)
- Optional: samtools, bcftools, bwa (for GWAS)
See CONTRIBUTING.md for full contribution guidelines.
Contributions are welcome! Please:
- Follow the existing code style
- Add tests for new features
- Update documentation
- Use informative commit messages
- Intelligent Caching: Automatic caching for expensive computations (Tajima's constants, entropy calculations)
- NumPy Vectorization: Optimized mathematical operations for 10-100x performance improvements
- Progress Tracking: Real-time progress bars for long-running analyses
- Memory Optimization: Efficient algorithms for large datasets
- Resilient Orchestration: Engineered automatic recovery flows and VM-level hard reset protocols to survive catastrophic 100% Docker overlay lockups caused by hidden
fasterq-dumpcaches.
- Comprehensive Tutorials: End-to-end guides for DNA, RNA, GWAS, and information theory workflows
- Method Comparison Guides: Decision-making guides for choosing analysis algorithms
- Extended FAQ: Troubleshooting and usage guidance for common scenarios
- Standardized Docstrings: Consistent formatting with examples and DOI citations
- Expanded Test Coverage: 37+ new comprehensive tests with real implementations
- Validation Enhancements: Improved parameter validation and error handling
- Cross-Platform Compatibility: Python 3.14 support and external drive optimization
- Integration Testing: Verified cross-module functionality
- Enhanced GWAS Visualization: Complete visualization suite for population structure, effects, and comparisons
- Information Theory Workflows: Batch processing with progress tracking
- Protein Proteome Analysis: Taxonomy ID processing and proteome utilities
- Advanced Error Handling: Structured error reporting with actionable guidance
If you use METAINFORMANT in your research, please cite this repository:
@software{metainformant2025,
author = {MetaInformAnt Development Team},
title = {MetaInformAnt: Comprehensive Bioinformatics Toolkit},
year = {2025},
url = {https://github.com/docxology/MetaInformAnt},
version = {0.2.6}
}This project is licensed under the Apache License, Version 2.0 - see LICENSE for details.
- Repository: https://github.com/docxology/MetaInformAnt
- Issues: https://github.com/docxology/MetaInformAnt/issues
- Documentation: https://github.com/docxology/MetaInformAnt/blob/main/docs/
- Developed with AI assistance from Cursor's Code Assistant (grok-code-fast-1)
- Built on established bioinformatics tools and libraries
- Community contributions and feedback
Status: Active Development | Version: 0.2.6 | Python: 3.11+ | License: Apache 2.0