RNA Module API Reference

Complete function and method reference for the METAINFORMANT RNA analysis module.

Quick Navigation

Amalgkit Step Functions - High-level step wrappers
Step Runner Functions - Low-level step execution
Workflow Functions - Workflow planning and execution
Genome Preparation Functions - Genome setup and indexing
Orchestration Functions - Multi-species workflow management
Utility Functions - CLI helpers and checks
Processing Functions - Sample processing pipelines
Monitoring Functions - Workflow progress and sample status tracking
Environment Functions - Tool availability and environment validation
Cleanup Functions - Partial download cleanup and file naming fixes
Analysis Functions - RNA-protein integration and translation analysis
Validation Functions - Sample and pipeline validation
Discovery Functions - Species discovery and configuration generation

Amalgkit Step Functions

High-level wrapper functions for each amalgkit subcommand. These functions provide a clean Python interface to the amalgkit CLI.

`metadata`

def metadata(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Retrieve RNA-seq metadata from NCBI SRA/ENA databases.

Parameters:

params: Dictionary of amalgkit parameters (e.g., {"out_dir": "work", "search_string": "..."})
**kwargs: Additional arguments passed to run_amalgkit (e.g., work_dir, log_dir, check)

Returns: subprocess.CompletedProcess[str] with execution results

See Also: Step Documentation: metadata

`integrate`

def integrate(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Integrate FASTQ file paths into metadata tables.

See Also: Step Documentation: integrate

`config`

def config(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Generate configuration files for metadata selection.

See Also: Step Documentation: config

`select`

def select(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Select and filter SRA entries based on quality criteria.

See Also: Step Documentation: select

`getfastq`

def getfastq(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Download and convert SRA files to FASTQ format.

See Also: Step Documentation: getfastq

`quant`

def quant(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Quantify transcript abundances using kallisto or salmon.

See Also: Step Documentation: quant

`merge`

def merge(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Merge per-sample quantification results into expression matrices.

See Also: Step Documentation: merge

`cstmm`

def cstmm(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Cross-species TMM (Trimmed Mean of M-values) normalization.

See Also: Step Documentation: cstmm

`curate`

def curate(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Quality control, outlier detection, and batch effect correction.

See Also: Step Documentation: curate

`csca`

def csca(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Cross-species correlation analysis and visualization.

See Also: Step Documentation: csca

`sanity`

def sanity(
    params: AmalgkitParams | None = None,
    **kwargs: Any
) -> subprocess.CompletedProcess[str]

Purpose: Validate workflow outputs and check data integrity.

See Also: Step Documentation: sanity

Step Runner Functions

Low-level step execution functions that provide more control over step execution. These are used internally by workflow orchestration but can also be called directly.

`run_metadata`

def run_metadata(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

Purpose: Execute metadata retrieval step with explicit directory control.

`run_integrate`

def run_integrate(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

`run_config`

def run_config(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

`run_select`

def run_select(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

`run_getfastq`

def run_getfastq(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.sra_extraction

Note: Includes robust retry logic and fallback mechanisms for failed downloads.

`run_quant`

def run_quant(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

`run_merge`

def run_merge(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

`run_cstmm`

def run_cstmm(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

`run_curate`

def run_curate(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

`run_csca`

def run_csca(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

`run_sanity`

def run_sanity(
    params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    check: bool = False,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.engine.workflow_steps

Workflow Functions

Functions for planning and executing complete RNA-seq workflows.

`load_workflow_config`

def load_workflow_config(config_file: str | Path) -> AmalgkitWorkflowConfig

Module: metainformant.rna.engine.workflow

Purpose: Load workflow configuration from YAML file.

Parameters:

config_file: Path to YAML configuration file

Returns: AmalgkitWorkflowConfig dataclass instance

See Also: Configuration Guide

`plan_workflow`

def plan_workflow(config: AmalgkitWorkflowConfig) -> list[tuple[str, AmalgkitParams]]

Module: metainformant.rna.engine.workflow

Purpose: Generate ordered list of workflow steps with parameters (dry-run planning).

Parameters:

config: Workflow configuration

Returns: List of (step_name, params) tuples in execution order

See Also: Workflow Guide

`plan_workflow_with_params`

def plan_workflow_with_params(
    config: AmalgkitWorkflowConfig,
    step_params: dict[str, AmalgkitParams],
) -> list[tuple[str, AmalgkitParams]]

Module: metainformant.rna.engine.workflow

Purpose: Plan workflow with explicit per-step parameter overrides.

`execute_workflow`

def execute_workflow(
    config: AmalgkitWorkflowConfig,
    *,
    check: bool = False,
    walk: bool = False,
    progress: bool = True,
    show_commands: bool = False,
) -> list[int]

Module: metainformant.rna.engine.workflow

Purpose: Execute complete workflow from configuration.

Parameters:

config: Workflow configuration
check: If True, raise exception on step failure
walk: If True (and running in a TTY), pause before each stage and wait for Enter
progress: If True, show a stage-level progress bar (uses tqdm if available)
show_commands: If True, log the exact amalgkit ... command for each stage before running it

Returns: List of return codes (one per step)

See Also: Workflow Guide

`AmalgkitWorkflowConfig`

@dataclass
class AmalgkitWorkflowConfig:
    work_dir: Path
    threads: int = 6
    species_list: list[str] = field(default_factory=list)
    log_dir: Path | None = None
    manifest_path: Path | None = None
    per_step: dict[str, AmalgkitParams] = field(default_factory=dict)
    auto_install_amalgkit: bool = False
    genome: dict[str, Any] | None = None
    filters: dict[str, Any] = field(default_factory=dict)

Module: metainformant.rna.engine.workflow

Purpose: Configuration dataclass for workflow execution.

See Also: Configuration Guide

Genome Preparation Functions

Functions for downloading genomes, preparing transcriptomes, and building kallisto indexes.

`prepare_genome_for_quantification`

def prepare_genome_for_quantification(
    genome_dir: Path,
    species_name: str,
    work_dir: Path,
    *,
    accession: str | None = None,
    build_index: bool = True,
    kmer_size: int = 31,
) -> dict[str, Any]

Module: metainformant.rna.genome_prep

Purpose: Complete genome preparation pipeline (download → extract → index).

Returns: Dictionary with success, fasta_path, index_path, error keys

See Also: Genome Preparation

`prepare_transcriptome_for_kallisto`

def prepare_transcriptome_for_kallisto(
    genome_dir: Path,
    species_name: str,
    work_dir: Path,
    *,
    accession: str | None = None,
) -> Path | None

Module: metainformant.rna.genome_prep

Purpose: Extract and prepare RNA FASTA file from genome package.

Returns: Path to prepared FASTA file or None if failed

`build_kallisto_index`

def build_kallisto_index(
    fasta_path: Path,
    index_path: Path,
    *,
    kmer_size: int = 31,
    check_existing: bool = True,
) -> bool

Module: metainformant.rna.genome_prep

Purpose: Build kallisto index from transcriptome FASTA.

Parameters:

fasta_path: Path to transcriptome FASTA file
index_path: Output path for kallisto index
kmer_size: K-mer size (31 for standard reads, 23 for short reads)
check_existing: Skip if index already exists

Returns: True if index was built successfully or already exists

`find_rna_fasta_in_genome_dir`

def find_rna_fasta_in_genome_dir(
    genome_dir: Path,
    accession: str
) -> Path | None

Module: metainformant.rna.genome_prep

Purpose: Locate RNA FASTA file in extracted genome directory.

`download_rna_fasta_from_ftp`

def download_rna_fasta_from_ftp(
    ftp_url: str,
    genome_dir: Path,
    accession: str,
    assembly_name: str | None = None,
    config: dict[str, Any] | None = None,
) -> Path | None

Module: metainformant.rna.genome_prep

Purpose: Download RNA FASTA directly from NCBI FTP.

`download_cds_fasta_from_ftp`

def download_cds_fasta_from_ftp(
    ftp_url: str,
    genome_dir: Path,
    accession: str,
    assembly_name: str | None = None,
) -> Path | None

Module: metainformant.rna.genome_prep

Purpose: Download CDS FASTA directly from NCBI FTP.

`extract_transcripts_from_gff`

def extract_transcripts_from_gff(
    gff_path: Path,
    genome_fasta: Path,
    output_fasta: Path,
) -> bool

Module: metainformant.rna.genome_prep

Purpose: Extract transcript sequences from GFF annotation using gffread.

`get_expected_index_path`

def get_expected_index_path(work_dir: Path, species_name: str) -> Path

Module: metainformant.rna.genome_prep

Purpose: Get expected kallisto index path for a species.

`verify_genome_status`

def verify_genome_status(
    genome_dir: Path,
    work_dir: Path,
    species_name: str,
    accession: str | None = None,
) -> dict[str, Any]

Module: metainformant.rna.genome_prep

Purpose: Check genome download, transcriptome preparation, and index status.

Returns: Dictionary with status flags and paths

`orchestrate_genome_setup`

def orchestrate_genome_setup(
    config_dir: Path = Path("config/amalgkit"),
    *,
    species: str | None = None,
    skip_download: bool = False,
    skip_prepare: bool = False,
    skip_build: bool = False,
) -> dict[str, Any]

Module: metainformant.rna.genome_prep

Purpose: Run complete genome setup pipeline for all or specific species.

See Also: Genome Setup Guide

Orchestration Functions

Functions for managing multi-species workflows and monitoring progress.

`discover_species_configs`

def discover_species_configs(
    config_dir: Path = Path("config/amalgkit")
) -> dict[str, dict[str, Any]]

Module: metainformant.rna.orchestration

Purpose: Discover all species configuration files in config directory.

Returns: Dictionary mapping species names to config dictionaries

`run_workflow_for_species`

def run_workflow_for_species(
    config_path: Path,
    steps: Sequence[str] | None = None,
    *,
    check: bool = False,
) -> dict[str, Any]

Module: metainformant.rna.orchestration

Purpose: Execute workflow steps for a single species.

Parameters:

config_path: Path to species workflow config file
steps: List of steps to run (default: all steps)
check: If True, stop on first failure

Returns: Dictionary with success, completed, failed, return_codes keys

`check_workflow_status`

def check_workflow_status(
    config_path: Path,
    *,
    detailed: bool = False,
) -> dict[str, Any]

Module: metainformant.rna.orchestration

Purpose: Check workflow status for a species. Unified interface that delegates to monitoring functions.

Parameters:

config_path: Path to species workflow config file
detailed: If True, return detailed analysis via analyze_species_status(); if False, return progress summary via check_workflow_progress()

Returns: Dictionary with status information (format depends on detailed parameter)

Note: This is a convenience wrapper. For direct access:

Use check_workflow_progress() for progress summary
Use analyze_species_status() for detailed analysis

`cleanup_unquantified_samples`

def cleanup_unquantified_samples(
    config_path: Path,
    *,
    log_dir: Path | None = None,
) -> tuple[int, int]

Module: metainformant.rna.orchestration

Purpose: Quantify downloaded samples and cleanup FASTQs. Finds all samples with FASTQ files but no quantification results, quantifies each sample, and deletes FASTQ files after successful quantification.

Parameters:

config_path: Path to species workflow config file
log_dir: Optional log directory

Returns: Tuple of (quantified_count, failed_count)

`monitor_workflows`

def monitor_workflows(
    species_configs: dict[str, Path],
    watch_interval: int = 60,
) -> None

Module: metainformant.rna.orchestration

Purpose: Real-time monitoring of multiple species workflows.

Parameters:

species_configs: Dictionary mapping species_id -> config_path
watch_interval: Update interval in seconds (default: 60)

Note: This function runs continuously until interrupted (Ctrl+C). It displays a real-time dashboard showing progress for all monitored species.

Utility Functions

Core utilities for CLI interaction and parameter handling.

`check_cli_available`

def check_cli_available() -> tuple[bool, str]

Module: metainformant.rna.amalgkit

Purpose: Check if amalgkit CLI is available on PATH.

Returns: Tuple of (is_available: bool, help_text_or_error: str)

`ensure_cli_available`

def ensure_cli_available(
    *,
    auto_install: bool = False
) -> tuple[bool, str, dict | None]

Module: metainformant.rna.amalgkit

Purpose: Ensure amalgkit CLI is available, optionally attempting auto-install.

Returns: Tuple of (ok: bool, message: str, install_record: dict | None)

`build_cli_args`

def build_cli_args(
    params: AmalgkitParams | None,
    *,
    for_cli: bool = False
) -> list[str]

Module: metainformant.rna.amalgkit

Purpose: Convert parameter dictionary to CLI argument list.

Parameters:

params: Parameter mapping
for_cli: If True, use snake_case flags (for actual CLI); if False, use kebab-case (for display)

Returns: List of CLI argument strings

`build_amalgkit_command`

def build_amalgkit_command(
    subcommand: str,
    params: AmalgkitParams | None = None
) -> list[str]

Module: metainformant.rna.amalgkit

Purpose: Build complete amalgkit command as token list.

Example: build_amalgkit_command("metadata", {"threads": 8}) → ["amalgkit", "metadata", "--threads", "8"]

`run_amalgkit`

def run_amalgkit(
    subcommand: str,
    params: AmalgkitParams | None = None,
    *,
    work_dir: str | Path | None = None,
    env: Mapping[str, str] | None = None,
    check: bool = False,
    capture_output: bool = True,
    log_dir: str | Path | None = None,
    step_name: str | None = None,
) -> subprocess.CompletedProcess[str]

Module: metainformant.rna.amalgkit

Purpose: Execute amalgkit subcommand with optional logging and directory management.

Parameters:

subcommand: Amalgkit subcommand name (e.g., "metadata", "quant")
params: Parameter dictionary
work_dir: Working directory (created if missing)
env: Additional environment variables
check: Raise exception on non-zero exit
capture_output: Capture stdout/stderr
log_dir: Directory for timestamped log files
step_name: Optional label for log filenames

Returns: subprocess.CompletedProcess[str] with execution results

Processing Functions

Functions for sample-level processing pipelines.

`quantify_sample`

def quantify_sample(
    sample_id: str,
    metadata_rows: list[dict[str, Any]],
    quant_params: Mapping[str, Any],
    *,
    log_dir: Path | None = None,
    step_name: str | None = None,
) -> tuple[bool, str, Path | None]

Module: metainformant.rna.engine.workflow_steps

Purpose: Quantify a single sample using amalgkit quant.

Returns: Tuple of (success: bool, message: str, abundance_file: Path | None)

`convert_sra_to_fastq`

def convert_sra_to_fastq(
    sample_id: str,
    sra_file: Path,
    output_dir: Path,
    *,
    threads: int = 4,
    log_dir: Path | None = None,
) -> tuple[bool, str, list[Path]]

Module: metainformant.rna.engine.sra_extraction

Purpose: Convert a local SRA file to FASTQ format. Prefers parallel-fastq-dump (works better with local files) and falls back to fasterq-dump if needed. Automatically compresses output FASTQ files.

Args:

sample_id: SRA accession ID (e.g., "SRR1234567")
sra_file: Path to the SRA file
output_dir: Directory where FASTQ files should be written
threads: Number of threads for conversion (default: 4)
log_dir: Optional directory for log files

Returns: Tuple of (success: bool, message: str, fastq_files: list[Path]). fastq_files contains paths to created FASTQ files (may be empty if failed).

Notes:

Automatically detects and uses the real fasterq-dump binary (not wrapper scripts)
Passes --size-check off to fasterq-dump to prevent "disk-limit exceeded" errors
Automatically compresses output using pigz or gzip
Checks for existing FASTQ files before conversion

`delete_sample_fastqs`

def delete_sample_fastqs(
    sample_id: str,
    fastq_dir: Path
) -> None

Module: metainformant.rna.engine.sra_extraction

Purpose: Delete FASTQ files for a specific sample.

`run_download_quant_workflow`

def run_download_quant_workflow(
    metadata_path: str | Path,
    getfastq_params: Mapping[str, Any] | None = None,
    quant_params: Mapping[str, Any] | None = None,
    *,
    work_dir: str | Path | None = None,
    log_dir: str | Path | None = None,
    num_workers: int = 1,
    max_samples: int | None = None,
    skip_completed: bool = True,
    progress_monitor: DownloadProgressMonitor | None = None,
) -> dict[str, Any]

Module: metainformant.rna.engine.pipeline

Purpose: Unified function for download-quantify-delete workflows. Supports both sequential and parallel processing modes.

Parameters:

metadata_path: Path to metadata TSV with sample list
getfastq_params: Parameters for amalgkit getfastq step
quant_params: Parameters for amalgkit quant step
work_dir: Working directory for amalgkit commands
log_dir: Directory for step logs
num_workers: Number of parallel download workers (default: 1 for sequential mode)
- num_workers=1: Sequential mode (one sample at a time, maximum disk efficiency)
- num_workers>1: Parallel mode (N downloads in parallel, sequential quantification)
max_samples: Optional limit on number of samples to process
skip_completed: If True, skip samples that are already quantified (sequential mode only)
progress_monitor: Optional progress monitor for tracking downloads

Returns: Dictionary with processing statistics:

total_samples: Total number of samples
processed: Number of samples successfully processed
skipped: Number of samples skipped (already done)
failed: Number of samples that failed
failed_runs: List of run IDs that failed

Processing Modes:

Sequential (num_workers=1): Process one sample at a time. Maximum disk efficiency: only one sample's FASTQs exist at a time.
Parallel (num_workers>1): Multiple download workers fetch FASTQ files in parallel, then a single quantification worker processes them sequentially. Maximizes throughput while preventing disk exhaustion.

See Also: Architecture Documentation

Type Definitions

`AmalgkitParams`

AmalgkitParams = Mapping[str, Any]

Module: metainformant.rna.amalgkit

Purpose: Type alias for amalgkit parameter dictionaries.

Monitoring Functions

Functions for tracking workflow progress and sample status.

`count_quantified_samples`

def count_quantified_samples(config_path: Path) -> tuple[int, int]

Module: metainformant.rna.monitoring

Purpose: Count quantified and total samples for a species.

Returns: Tuple of (quantified_count, total_count)

`get_sample_status`

def get_sample_status(config_path: Path, sample_id: str) -> dict[str, Any]

Module: metainformant.rna.monitoring

Purpose: Get detailed status for a single sample.

Returns: Dictionary with status information:

quantified: bool
has_fastq: bool
has_sra: bool
is_downloading: bool
status: str ("quantified", "downloading", "has_fastq", "has_sra", "undownloaded")

`analyze_species_status`

def analyze_species_status(config_path: Path) -> dict[str, Any]

Module: metainformant.rna.monitoring

Purpose: Comprehensive analysis of species workflow status.

Returns: Dictionary with comprehensive status information:

total_in_metadata: int
quantified: int
quantified_and_deleted: int
quantified_not_deleted: int
downloading: int
failed_download: int
undownloaded: int
categories: dict mapping category -> list of sample_ids

`find_unquantified_samples`

def find_unquantified_samples(config_path: Path) -> list[str]

Module: metainformant.rna.monitoring

Purpose: Find all unquantified samples.

Returns: List of sample IDs that are not quantified

`check_active_downloads`

def check_active_downloads() -> set[str]

Module: metainformant.rna.monitoring

Purpose: Check for samples currently being downloaded.

Returns: Set of sample IDs that are actively downloading

`check_workflow_progress`

def check_workflow_progress(config_path: Path) -> dict[str, Any]

Module: metainformant.rna.monitoring

Purpose: Get workflow progress summary for a species.

Returns: Dictionary with progress information:

quantified: int
total: int
percentage: float
remaining: int
downloading: int (number of samples currently downloading)
has_files: int (number of samples with downloaded files but not quantified)

Note: This is called by check_workflow_status() when detailed=False. Use check_workflow_status() for the unified interface.

`assess_all_species_progress`

def assess_all_species_progress(
    config_dir: Path,
    *,
    repo_root: Path | None = None,
) -> dict[str, dict[str, Any]]

Module: metainformant.rna.monitoring

Purpose: Assess progress for all species in config directory.

Returns: Dictionary mapping species_id -> progress information

`initialize_progress_tracking`

def initialize_progress_tracking(
    config_path: Path,
    *,
    tracker=None,
) -> dict[str, Any]

Module: metainformant.rna.monitoring

Purpose: Initialize progress tracking for a species.

Returns: Dictionary with initialization results

Environment Functions

Functions for checking tool availability and environment validation.

`check_amalgkit`

def check_amalgkit() -> tuple[bool, str]

Module: metainformant.rna.environment

Purpose: Check if amalgkit is available and get version.

Returns: Tuple of (is_available: bool, message: str)

`check_sra_toolkit`

def check_sra_toolkit() -> tuple[bool, str]

Module: metainformant.rna.environment

Purpose: Check if SRA Toolkit is installed.

Returns: Tuple of (is_available: bool, message: str)

`check_kallisto`

def check_kallisto() -> tuple[bool, str]

Module: metainformant.rna.environment

Purpose: Check if kallisto is installed.

Returns: Tuple of (is_available: bool, message: str)

`check_metainformant`

def check_metainformant() -> tuple[bool, str]

Module: metainformant.rna.environment

Purpose: Check if metainformant package is installed.

Returns: Tuple of (is_available: bool, message: str)

`check_virtual_env`

def check_virtual_env() -> tuple[bool, str]

Module: metainformant.rna.environment

Purpose: Check if running inside a virtual environment.

Returns: Tuple of (is_in_venv: bool, message: str)

`check_rscript`

def check_rscript() -> tuple[bool, str]

Module: metainformant.rna.environment

Purpose: Check if Rscript is available.

Returns: Tuple of (is_available: bool, message: str)

`check_dependencies`

def check_dependencies() -> dict[str, tuple[bool, str]]

Module: metainformant.rna.environment

Purpose: Check all required dependencies for RNA-seq workflows.

Returns: Dictionary mapping dependency name -> (is_available: bool, message: str)

`validate_environment`

def validate_environment() -> dict[str, Any]

Module: metainformant.rna.environment

Purpose: Comprehensive environment validation.

Returns: Dictionary with validation results:

all_passed: bool
dependencies: dict mapping name -> (is_available, message)
recommendations: list of strings with recommendations

Cleanup Functions

Functions for cleaning up partial downloads and fixing file naming issues.

`cleanup_partial_downloads`

def cleanup_partial_downloads(
    config_path: Path,
    *,
    dry_run: bool = False,
) -> dict[str, Any]

Module: metainformant.rna.cleanup

Purpose: Clean up partial downloads for a species.

Parameters:

config_path: Path to species workflow config file
dry_run: If True, only report what would be deleted

Returns: Dictionary with deleted, freed_mb, errors keys

`fix_abundance_naming`

def fix_abundance_naming(quant_dir: Path, sample_id: str) -> bool

Module: metainformant.rna.cleanup

Purpose: Create symlink from abundance.tsv to {SRR}_abundance.tsv for amalgkit merge.

Parameters:

quant_dir: Directory containing quantification results
sample_id: Sample ID (e.g., 'SRR1234567')

Returns: True if symlink was created or already exists

`fix_abundance_naming_for_species`

def fix_abundance_naming_for_species(
    config_path: Path,
) -> tuple[int, int]

Module: metainformant.rna.cleanup

Purpose: Fix abundance naming for all samples in a species.

Returns: Tuple of (created_count, already_exists_count)

Analysis Functions

Functions for RNA-protein integration, translation efficiency analysis, and ribosome profiling.

`calculate_translation_efficiency`

def calculate_translation_efficiency(
    rna_df: pd.DataFrame,
    protein_df: pd.DataFrame,
    method: str = "ratio",
) -> pd.DataFrame

Module: metainformant.rna.analysis.protein_integration

Purpose: Calculate translation efficiency from RNA and protein levels.

Parameters:

rna_df: RNA expression data (samples × genes)
protein_df: Protein abundance data (samples × genes)
method: Calculation method ("ratio" or "correlation")

Returns: DataFrame with columns gene_id, efficiency, method

`predict_protein_abundance_from_rna`

def predict_protein_abundance_from_rna(
    rna_df: pd.DataFrame,
    training_rna: pd.DataFrame | None = None,
    training_protein: pd.DataFrame | None = None,
    method: str = "linear",
) -> pd.DataFrame

Module: metainformant.rna.analysis.protein_integration

Purpose: Predict protein abundance levels from RNA expression data.

Parameters:

rna_df: RNA expression data for prediction
training_rna: Optional RNA data from training set
training_protein: Optional protein data from training set
method: Prediction method ("linear" or "lognormal")

Returns: DataFrame with predicted protein abundance (same shape as rna_df)

`ribosome_profiling_integration`

def ribosome_profiling_integration(
    rna_df: pd.DataFrame,
    ribo_df: pd.DataFrame,
) -> pd.DataFrame

Module: metainformant.rna.analysis.protein_integration

Purpose: Integrate ribosome profiling data with RNA-seq to identify translationally regulated genes.

Parameters:

rna_df: RNA-seq expression data (samples × genes)
ribo_df: Ribosome profiling data (samples × genes)

Returns: DataFrame with columns:

gene_id: Gene identifier
rna_level: Mean RNA expression
ribo_level: Mean ribosome occupancy
translation_rate: Ribo/RNA ratio
translationally_regulated: Boolean indicating significant regulation

Validation Functions

Functions for validating RNA-seq pipeline outputs and sample quality.

`validate_all_samples`

def validate_all_samples(config: AmalgkitWorkflowConfig) -> dict[str, Any]

Module: metainformant.rna.analysis.validation

Purpose: Validate all samples in a workflow for pipeline completion.

Parameters:

config: Workflow configuration

Returns: Dictionary with validation results including per-sample status and summary statistics

Discovery Functions

Functions for discovering species with RNA-seq data and generating configurations.

`search_species_with_rnaseq`

def search_species_with_rnaseq(
    search_query: str,
    *,
    max_records: int = 10000,
) -> dict[str, dict[str, Any]]

Module: metainformant.rna.discovery

Purpose: Search NCBI SRA for species with RNA-seq data.

Parameters:

search_query: NCBI Entrez search query
max_records: Maximum number of records to retrieve

Returns: Dictionary mapping species names to metadata

Raises: ImportError if Biopython is not available

`get_genome_info`

def get_genome_info(taxonomy_id: str, species_name: str) -> dict[str, Any] | None

Module: metainformant.rna.discovery

Purpose: Get genome assembly information for a species.

Parameters:

taxonomy_id: NCBI taxonomy ID
species_name: Scientific name

Returns: Genome information dictionary or None

`generate_config_yaml`

def generate_config_yaml(
    species_name: str,
    species_data: dict[str, Any],
    genome_info: dict[str, Any] | None = None,
    *,
    repo_root: Path | None = None,
) -> str

Module: metainformant.rna.discovery

Purpose: Generate amalgkit YAML configuration for a species.

Parameters:

species_name: Scientific name
species_data: RNA-seq metadata
genome_info: Genome assembly metadata (optional)
repo_root: Repository root directory for paths (optional)

Returns: YAML configuration string

FilesExpand file tree

API.md

Latest commit

History

API.md

File metadata and controls

RNA Module API Reference

Quick Navigation

Amalgkit Step Functions

metadata

integrate

config

select

getfastq

quant

merge

cstmm

curate

csca

sanity

Step Runner Functions

run_metadata

run_integrate

run_config

run_select

run_getfastq

run_quant

run_merge

run_cstmm

run_curate

run_csca

run_sanity

Workflow Functions

load_workflow_config

plan_workflow

plan_workflow_with_params

execute_workflow

AmalgkitWorkflowConfig

Genome Preparation Functions

prepare_genome_for_quantification

prepare_transcriptome_for_kallisto

build_kallisto_index

find_rna_fasta_in_genome_dir

download_rna_fasta_from_ftp

download_cds_fasta_from_ftp

extract_transcripts_from_gff

get_expected_index_path

verify_genome_status

orchestrate_genome_setup

Orchestration Functions

discover_species_configs

run_workflow_for_species

check_workflow_status

cleanup_unquantified_samples

monitor_workflows

Utility Functions

check_cli_available

ensure_cli_available

build_cli_args

build_amalgkit_command

run_amalgkit

Processing Functions

quantify_sample

convert_sra_to_fastq

delete_sample_fastqs

run_download_quant_workflow

Type Definitions

AmalgkitParams

Monitoring Functions

count_quantified_samples

get_sample_status

analyze_species_status

find_unquantified_samples

check_active_downloads

check_workflow_progress

assess_all_species_progress

initialize_progress_tracking

Environment Functions

check_amalgkit

check_sra_toolkit

`metadata`

`integrate`

`config`

`select`

`getfastq`

`quant`

`merge`

`cstmm`

`curate`

`csca`

`sanity`

`run_metadata`

`run_integrate`

`run_config`

`run_select`

`run_getfastq`

`run_quant`

`run_merge`

`run_cstmm`

`run_curate`

`run_csca`

`run_sanity`

`load_workflow_config`

`plan_workflow`

`plan_workflow_with_params`

`execute_workflow`

`AmalgkitWorkflowConfig`

`prepare_genome_for_quantification`

`prepare_transcriptome_for_kallisto`

`build_kallisto_index`

`find_rna_fasta_in_genome_dir`

`download_rna_fasta_from_ftp`

`download_cds_fasta_from_ftp`

`extract_transcripts_from_gff`

`get_expected_index_path`

`verify_genome_status`

`orchestrate_genome_setup`

`discover_species_configs`

`run_workflow_for_species`

`check_workflow_status`

`cleanup_unquantified_samples`

`monitor_workflows`

`check_cli_available`

`ensure_cli_available`

`build_cli_args`

`build_amalgkit_command`

`run_amalgkit`

`quantify_sample`

`convert_sra_to_fastq`

`delete_sample_fastqs`

`run_download_quant_workflow`

`AmalgkitParams`

`count_quantified_samples`

`get_sample_status`

`analyze_species_status`

`find_unquantified_samples`

`check_active_downloads`

`check_workflow_progress`

`assess_all_species_progress`

`initialize_progress_tracking`

`check_amalgkit`

`check_sra_toolkit`

`check_kallisto`

`check_metainformant`

`check_virtual_env`

`check_rscript`

`check_dependencies`

`validate_environment`

`cleanup_partial_downloads`

`fix_abundance_naming`

`fix_abundance_naming_for_species`

`calculate_translation_efficiency`

`predict_protein_abundance_from_rna`

`ribosome_profiling_integration`

`validate_all_samples`

`search_species_with_rnaseq`

`get_genome_info`

`generate_config_yaml`