Skip to content

Latest commit

 

History

History
480 lines (383 loc) · 14.6 KB

File metadata and controls

480 lines (383 loc) · 14.6 KB

METAINFORMANT Documentation

:start-after: "## Overview"
:end-before: "## Installation"

Documentation Navigation

graph TD
    AmetainformantDocumentation[METAINFORMANT Documentation] --> BgettingStarted[Getting Started]
    A --> CuserGuides[User Guides]
    A --> DmoduleDocumentation[Module Documentation]
    A --> EdeveloperResources[Developer Resources]
    A --> F[Reference]

    B --> B1[Installation]
    B --> B2quickStart[Quick Start]
    B --> B3[Tutorials]

    C --> C1workflowGuides[Workflow Guides]
    C --> C2bestPractices[Best Practices]
    C --> C3[Troubleshooting]
    C3 --> C3a[RNA Troubleshooting]

    D --> D1coreModules[Core Modules]
    D --> D2molecularAnalysis[Molecular Analysis]
    D --> D3statisticalMethods[Statistical Methods]
    D --> D4systemsBiology[Systems Biology]
    D --> D5annotation&Metadata[Annotation & Metadata]
    D --> D6[Utilities]

    E --> E1[Architecture]
    E --> E2[Contributing]
    E --> E3[Testing]
    E --> E4apiReference[API Reference]

    F --> F1cliReference[CLI Reference]
    F --> F2[Configuration]
    F --> F3errorCodes[Error Codes]


    subgraph "Primary Entry Points"
        G1[README.md] -.-> B
        G2[QUICKSTART.md] -.-> B
        G3[TUTORIALS.md] -.-> C
    end

    subgraph "Module Categories"
        H1[core/] -.-> D1
        H2dna/,Rna/,Protein/,Epigenome/[dna/, rna/, protein/, epigenome/] -.-> D2
        H3gwas/,Math/,Ml/,Information/[gwas/, math/, ml/, information/] -.-> D3
        H4networks/,Multiomics/,Singlecell/,Simulation/[networks/, multiomics/, singlecell/, simulation/] -.-> D4
        H5ontology/,Phenotype/,Ecology/,LifeEvents/[ontology/, phenotype/, ecology/, life_events/] -.-> D5
        H6quality/,Visualization/[quality/, visualization/] -.-> D6
        H7longread/,Metagenomics/,Structural_variants/,Spatial/,Pharmacogenomics/,Metabolomics/,Menu/[longread/, metagenomics/, structural_variants/, spatial/, pharmacogenomics/, metabolomics/, menu/] -.-> D4
        H8cloud/[cloud/] -.-> D6
    end

    subgraph "Key Documents"
        I1[architecture.md] -.-> E1
        I2[testing.md] -.-> E3
        I3[cli.md] -.-> F1
        I4uvSetup.md[UV_SETUP.md] -.-> B1
    end
Loading

Module Overview Matrix

Category Module Description Key Features
Core core Shared utilities and infrastructure Configuration, I/O, logging, parallel processing, caching
DNA dna Genomic sequence analysis Sequences, alignment, phylogeny, population genetics
RNA rna Transcriptomic analysis RNA-seq workflows, amalgkit integration
Protein protein Protein structure and function Sequences, AlphaFold integration, proteomics
Epigenome epigenome Epigenetic modifications Methylation, ChIP-seq, chromatin accessibility
GWAS gwas Genome-wide association studies Association testing, quality control, visualization
Math math Mathematical biology Population genetics theory, coalescent models
ML ml Machine learning pipelines Classification, regression, feature selection
Information information Information theory Entropy, mutual information, semantic similarity
Networks networks Biological networks PPI, pathways, community detection
Multi-omics multiomics Multi-omic integration Joint analysis, data harmonization
Single-cell singlecell Single-cell genomics Preprocessing, clustering, trajectory analysis
Simulation simulation Synthetic data generation Sequence simulation, agent-based models
Ontology ontology Functional annotation Gene Ontology, semantic similarity
Phenotype phenotype Phenotypic data Trait analysis, AntWiki integration
Ecology ecology Ecological analysis Community diversity, environmental data
Life Events life_events Temporal event analysis Life course modeling, embeddings
Quality quality Data quality assessment FASTQ analysis, assembly validation
Visualization visualization Plotting and graphics 70+ specialized plotting modules
Cloud cloud Cloud deployment GCP VM lifecycle, Docker pipelines, genome prep
Long-Read longread Long-read sequencing PacBio/ONT, assembly, error correction
Metagenomics metagenomics Metagenomic analysis Taxonomic profiling, functional annotation
Structural Variants structural_variants SV/CNV analysis Detection, breakpoint resolution
Spatial spatial Spatial transcriptomics Tissue mapping, spatial statistics
Pharmacogenomics pharmacogenomics Clinical genomics Drug-gene interactions, variant interpretation
Metabolomics metabolomics Metabolomic analysis MS data processing, pathway mapping
eQTL eqtl eQTL integration (cross-cutting) RNA×GWAS integration — logic in gwas and multiomics
MCP mcp Model Context Protocol LLM tool integrations
Menu menu Interactive navigation CLI menu system, workflow discovery

Module Selection Decision Tree

flowchart TD
    A[What are you analyzing?] --> B{Data Type}
    
    B -->|Sequences<br/>FASTA/FASTQ| C[DNA? RNA? Proteins?]
    B -->|Variants<br/>VCF| D[GWAS / Population Genetics]
    B -->|Annotations<br/>GFF/GTF| E[Functional / Ontology]
    B -->|Counts<br/>Matrix| F[Expression / Abundance]
    B -->|Networks<br/>Edges| G[Pathways / Interactions]
    
    C -->|DNA| H[dna]<br>Sequences, alignment, trees
    C -->|RNA| I[rna]<br>Amalgkit, ENA/SRA
    C -->|Proteins| J[protein]<br>Structure + function
    C -->|Epigenetic| K[epigenome]<br>Bisulfite, ChIP-seq
    
    D --> L[gwas]<br>Association testing
    D --> M[multiomics]<br>eQTL, integration
    
    E --> N[ontology]<br>GO, KEGG, enrichment
    E --> O[phenotype]<br>Trait analysis
    
    F --> P[singlecell]<br>scRNA-seq, clustering
    F --> Q[metagenomics]<br>16S, metagenome
    
    G --> R[networks]<br>Graph + pathway analysis
    
    style H fill:#e1f5ff
    style I fill:#e1f5ff
    style J fill:#e1f5ff
    style K fill:#e1f5ff
    style L fill:#e1f5ff
    style M fill:#e1f5ff
    style N fill:#e1f5ff
    style O fill:#e1f5ff
    style P fill:#e1f5ff
    style Q fill:#e1f5ff
    style R fill:#e1f5ff
Loading

Click any module name for detailed documentation.


Data Flow Architecture

graph LR
    ArawData[Raw Data] --> B[Ingestion]
    B --> C[Validation]
    C --> D{Data Type}

    D -->|Genomic| EdnaPipeline[DNA Pipeline]
    D -->|Transcriptomic| FrnaPipeline[RNA Pipeline]
    D -->|Proteomic| GproteinPipeline[Protein Pipeline]
    D -->|Phenotypic| HphenotypePipeline[Phenotype Pipeline]

    E --> IqualityControl[Quality Control]
    F --> I
    G --> I
    H --> I

    I --> J[Analysis]
    J --> K{Integration Level}

    K -->|Single-omic| LindividualResults[Individual Results]
    K -->|Multi-omic| MintegratedResults[Integrated Results]

    L --> N[Visualization]
    M --> N

    N --> O[Publication]
    O --> PscientificInsights[Scientific Insights]


    subgraph "Data Sources"
        Q[NCBI] -.-> E
        R[SRA] -.-> F
        S[PDB] -.-> G
        T[AntWiki] -.-> H
    end

    subgraph "Analysis Types"
        UsequenceAnalysis[Sequence Analysis] -.-> E
        VexpressionAnalysis[Expression Analysis] -.-> F
        WstructureAnalysis[Structure Analysis] -.-> G
        XtraitAnalysis[Trait Analysis] -.-> H
    end

    subgraph "Integration Methods"
        Y[GWAS] -.-> K
        Z[Networks] -.-> K
        AA[ML] -.-> K
        BBsystemsBiology[Systems Biology] -.-> K
    end
Loading

Quick Start

Installation

METAINFORMANT uses uv for Python package management. Install with uv, or install from source with uv pip install -e .:

# Install into the active environment
uv pip install metainformant

# From source
git clone https://github.com/docxology/MetaInformAnt.git
cd metainformant
uv pip install -e .

Basic Usage

from metainformant.dna.sequence import composition
from metainformant.rna.engine.workflow import AmalgkitWorkflowConfig, execute_workflow

seq = "ATCGATCGATCG"
gc = composition.gc_content(seq)
print(f"GC content: {gc:.2f}")

config = AmalgkitWorkflowConfig(
    work_dir="output/rna_analysis",
    species_list=["Apis_mellifera"],
)
results = execute_workflow(config)

Complete Pipeline Example (DNA → GWAS → Visualization)

For a production-grade end-to-end workflow covering genomic association studies, see the Integration Guide. It demonstrates how to:

  • Parse & filter VCF files (dna.variants)
  • Run mixed-model association (gwas.analysis)
  • Construct fine-mapping credible sets (gwas.finemapping)
  • Generate Manhattan, QQ, and LocusZoom plots (visualization)
  • Orchestrate distributed processing with caching

The guide provides both minimal and full-featured implementations with performance benchmarks and troubleshooting tips.


Command Line Interface

The metainformant command exposes a small CLI (--version, --modules, protein, quality batch-detect, rna info, gwas info). RNA and GWAS pipelines use Python APIs or scripts/*/run_*.py. See cli.md.

uv run metainformant --help
uv run metainformant protein comp --fasta data/example.faa
python3 scripts/rna/run_workflow.py --config config/amalgkit/amalgkit_pogonomyrmex_barbatus.yaml

Documentation Contents

User Guides

:maxdepth: 2
:caption: User Guides

setup
UV_SETUP
DOCUMENTATION_GUIDE
TUTORIALS
INTEGRATION
FAQ
DISK_SPACE_MANAGEMENT
LINUX_TRANSFER
rna/amalgkit/TROUBLESHOOTING
ERROR_HANDLING

API Reference

:maxdepth: 2
:caption: Reference

architecture
cli
SPEC
ORCHESTRATION
COMPARISON_GUIDES
NO_MOCKING_POLICY

Module Documentation

:maxdepth: 2
:caption: Modules

core/index
dna/index
rna/index
protein/index
gwas/index
epigenome/index
eqtl/README
ontology/index
phenotype/index
ecology/index
math/index
information/index
ml/index
networks/index
multiomics/index
singlecell/index
quality/index
visualization/index
simulation/index
life_events/index
longread/index
cloud/index
mcp/index
metagenomics/index
structural_variants/index
spatial/index
pharmacogenomics/index
metabolomics/index
menu/index
mcp/index
cloud/README
agents/rules/index

Development

:maxdepth: 2
:caption: Development

testing

Key Features

Comprehensive Domain Coverage

  • DNA Analysis: Sequence composition, alignments, phylogenetics, population genetics
  • RNA Analysis: Transcriptome quantification, differential expression, cross-species analysis
  • Protein Analysis: Structure prediction, domain analysis, functional annotation
  • GWAS: Genome-wide association studies with population structure correction
  • Epigenomics: DNA methylation analysis, chromatin accessibility
  • Systems Biology: Network analysis, pathway enrichment, multi-omics integration

Production Ready

  • Real Implementations: No mocks or fakes - actual external API calls and tool integration
  • Scalable: Parallel processing, memory-efficient algorithms for large datasets
  • Robust: Comprehensive error handling and validation
  • Tested: Extensive test suite with real-world validation

Developer Friendly

  • Type Hints: Full type annotation throughout codebase
  • Documentation: Comprehensive docstrings and API documentation
  • CLI: Intuitive command-line interface with subcommands
  • Modular: Clean separation of concerns, easy to extend

Research Grade

  • Scientific Rigor: Algorithms validated against established methods
  • Reproducible: Version-controlled configurations and deterministic workflows
  • Standards Compliant: Follows bioinformatics best practices and data formats

Architecture

graph TB
    subgraph "User Interfaces"
        CLI[commandLineInterface]
        API[pythonAPI]
        Scripts[workflowScripts]
    end

    subgraph "Core Framework"
        Workflow[workflowOrchestration]
        Config[configurationManagement]
        IO[inputOutputUtilities]
        Logging[structuredLogging]
    end

    subgraph "Domain Modules"
        DNA[dnaAnalysis]
        RNA[rnaAnalysis]
        PROT[proteinAnalysis]
        GWAS[gwasAnalysis]
        EPI[epigenomicsAnalysis]
        ONT[ontologyAnalysis]
        PHENO[phenotypeAnalysis]
        ECOL[ecologyAnalysis]
        MATH[mathBiology]
        ML[machineLearning]
        NET[networksAnalysis]
        SC[singleCellAnalysis]
        QUAL[qualityControl]
        VIZ[visualizationModule]
        SIM[simulationModule]
        LE[lifeEventsAnalysis]
    end

    CLI --> Workflow
    API --> Workflow
    Scripts --> Workflow

    Workflow --> Config
    Workflow --> IO
    Workflow --> Logging

    DNA --> Core
    RNA --> Core
    PROT --> Core
    GWAS --> Core
    EPI --> Core
    ONT --> Core
    PHENO --> Core
    ECOL --> Core
    MATH --> Core
    ML --> Core
    NET --> Core
    SC --> Core
    QUAL --> Core
    VIZ --> Core
    SIM --> Core
    LE --> Core
    LR[longreadAnalysis] --> Core
    METAG[metagenomicsAnalysis] --> Core
    SVA[structuralVariants] --> Core
    SPAT[spatialTranscriptomics] --> Core
    PHARM[pharmacogenomics] --> Core
    METAB[metabolomics] --> Core
    MENUX[menuSystem] --> Core
    CLOUD[cloudDeployment] --> Core

Common Tasks & Quick Commands

:maxdepth: 1
:caption: Task Reference

tasks/analyze_dna
tasks/run_rna_pipeline
tasks/run_gwas
tasks/deploy_cloud
tasks/visualize_results
tasks/mcp_integration
tasks/performance_tuning
tasks/data_conversion

Getting Help

If you'd like to contribute, see CONTRIBUTING.md.

Community Support

  • GitHub Issues: Report bugs and request features
  • Discussions: Ask questions and share ideas
  • Documentation: Comprehensive guides and API reference

Development

  • Contributing Guide: How to contribute to METAINFORMANT
  • Development Setup: Setting up development environment
  • Testing Guide: Running and writing tests
  • API Design: Understanding the codebase architecture

METAINFORMANT is a comprehensive bioinformatics toolkit designed for modern multi-omics research. Whether you're analyzing genomes, transcriptomes, proteomes, or integrating multi-omics datasets, METAINFORMANT provides the tools and workflows you need.