Complete function and method reference for the METAINFORMANT RNA analysis module.
- Amalgkit Step Functions - High-level step wrappers
- Step Runner Functions - Low-level step execution
- Workflow Functions - Workflow planning and execution
- Genome Preparation Functions - Genome setup and indexing
- Orchestration Functions - Multi-species workflow management
- Utility Functions - CLI helpers and checks
- Processing Functions - Sample processing pipelines
- Monitoring Functions - Workflow progress and sample status tracking
- Environment Functions - Tool availability and environment validation
- Cleanup Functions - Partial download cleanup and file naming fixes
- Analysis Functions - RNA-protein integration and translation analysis
- Validation Functions - Sample and pipeline validation
- Discovery Functions - Species discovery and configuration generation
High-level wrapper functions for each amalgkit subcommand. These functions provide a clean Python interface to the amalgkit CLI.
def metadata(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Retrieve RNA-seq metadata from NCBI SRA/ENA databases.
Parameters:
params: Dictionary of amalgkit parameters (e.g.,{"out_dir": "work", "search_string": "..."})**kwargs: Additional arguments passed torun_amalgkit(e.g.,work_dir,log_dir,check)
Returns: subprocess.CompletedProcess[str] with execution results
See Also: Step Documentation: metadata
def integrate(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Integrate FASTQ file paths into metadata tables.
See Also: Step Documentation: integrate
def config(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Generate configuration files for metadata selection.
See Also: Step Documentation: config
def select(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Select and filter SRA entries based on quality criteria.
See Also: Step Documentation: select
def getfastq(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Download and convert SRA files to FASTQ format.
See Also: Step Documentation: getfastq
def quant(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Quantify transcript abundances using kallisto or salmon.
See Also: Step Documentation: quant
def merge(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Merge per-sample quantification results into expression matrices.
See Also: Step Documentation: merge
def cstmm(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Cross-species TMM (Trimmed Mean of M-values) normalization.
See Also: Step Documentation: cstmm
def curate(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Quality control, outlier detection, and batch effect correction.
See Also: Step Documentation: curate
def csca(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Cross-species correlation analysis and visualization.
See Also: Step Documentation: csca
def sanity(
params: AmalgkitParams | None = None,
**kwargs: Any
) -> subprocess.CompletedProcess[str]Purpose: Validate workflow outputs and check data integrity.
See Also: Step Documentation: sanity
Low-level step execution functions that provide more control over step execution. These are used internally by workflow orchestration but can also be called directly.
def run_metadata(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
Purpose: Execute metadata retrieval step with explicit directory control.
def run_integrate(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
def run_config(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
def run_select(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
def run_getfastq(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.sra_extraction
Note: Includes robust retry logic and fallback mechanisms for failed downloads.
def run_quant(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
def run_merge(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
def run_cstmm(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
def run_curate(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
def run_csca(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
def run_sanity(
params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
check: bool = False,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.engine.workflow_steps
Functions for planning and executing complete RNA-seq workflows.
def load_workflow_config(config_file: str | Path) -> AmalgkitWorkflowConfigModule: metainformant.rna.engine.workflow
Purpose: Load workflow configuration from YAML file.
Parameters:
config_file: Path to YAML configuration file
Returns: AmalgkitWorkflowConfig dataclass instance
See Also: Configuration Guide
def plan_workflow(config: AmalgkitWorkflowConfig) -> list[tuple[str, AmalgkitParams]]Module: metainformant.rna.engine.workflow
Purpose: Generate ordered list of workflow steps with parameters (dry-run planning).
Parameters:
config: Workflow configuration
Returns: List of (step_name, params) tuples in execution order
See Also: Workflow Guide
def plan_workflow_with_params(
config: AmalgkitWorkflowConfig,
step_params: dict[str, AmalgkitParams],
) -> list[tuple[str, AmalgkitParams]]Module: metainformant.rna.engine.workflow
Purpose: Plan workflow with explicit per-step parameter overrides.
def execute_workflow(
config: AmalgkitWorkflowConfig,
*,
check: bool = False,
walk: bool = False,
progress: bool = True,
show_commands: bool = False,
) -> list[int]Module: metainformant.rna.engine.workflow
Purpose: Execute complete workflow from configuration.
Parameters:
config: Workflow configurationcheck: If True, raise exception on step failurewalk: If True (and running in a TTY), pause before each stage and wait for Enterprogress: If True, show a stage-level progress bar (uses tqdm if available)show_commands: If True, log the exactamalgkit ...command for each stage before running it
Returns: List of return codes (one per step)
See Also: Workflow Guide
@dataclass
class AmalgkitWorkflowConfig:
work_dir: Path
threads: int = 6
species_list: list[str] = field(default_factory=list)
log_dir: Path | None = None
manifest_path: Path | None = None
per_step: dict[str, AmalgkitParams] = field(default_factory=dict)
auto_install_amalgkit: bool = False
genome: dict[str, Any] | None = None
filters: dict[str, Any] = field(default_factory=dict)Module: metainformant.rna.engine.workflow
Purpose: Configuration dataclass for workflow execution.
See Also: Configuration Guide
Functions for downloading genomes, preparing transcriptomes, and building kallisto indexes.
def prepare_genome_for_quantification(
genome_dir: Path,
species_name: str,
work_dir: Path,
*,
accession: str | None = None,
build_index: bool = True,
kmer_size: int = 31,
) -> dict[str, Any]Module: metainformant.rna.genome_prep
Purpose: Complete genome preparation pipeline (download → extract → index).
Returns: Dictionary with success, fasta_path, index_path, error keys
See Also: Genome Preparation
def prepare_transcriptome_for_kallisto(
genome_dir: Path,
species_name: str,
work_dir: Path,
*,
accession: str | None = None,
) -> Path | NoneModule: metainformant.rna.genome_prep
Purpose: Extract and prepare RNA FASTA file from genome package.
Returns: Path to prepared FASTA file or None if failed
def build_kallisto_index(
fasta_path: Path,
index_path: Path,
*,
kmer_size: int = 31,
check_existing: bool = True,
) -> boolModule: metainformant.rna.genome_prep
Purpose: Build kallisto index from transcriptome FASTA.
Parameters:
fasta_path: Path to transcriptome FASTA fileindex_path: Output path for kallisto indexkmer_size: K-mer size (31 for standard reads, 23 for short reads)check_existing: Skip if index already exists
Returns: True if index was built successfully or already exists
def find_rna_fasta_in_genome_dir(
genome_dir: Path,
accession: str
) -> Path | NoneModule: metainformant.rna.genome_prep
Purpose: Locate RNA FASTA file in extracted genome directory.
def download_rna_fasta_from_ftp(
ftp_url: str,
genome_dir: Path,
accession: str,
assembly_name: str | None = None,
config: dict[str, Any] | None = None,
) -> Path | NoneModule: metainformant.rna.genome_prep
Purpose: Download RNA FASTA directly from NCBI FTP.
def download_cds_fasta_from_ftp(
ftp_url: str,
genome_dir: Path,
accession: str,
assembly_name: str | None = None,
) -> Path | NoneModule: metainformant.rna.genome_prep
Purpose: Download CDS FASTA directly from NCBI FTP.
def extract_transcripts_from_gff(
gff_path: Path,
genome_fasta: Path,
output_fasta: Path,
) -> boolModule: metainformant.rna.genome_prep
Purpose: Extract transcript sequences from GFF annotation using gffread.
def get_expected_index_path(work_dir: Path, species_name: str) -> PathModule: metainformant.rna.genome_prep
Purpose: Get expected kallisto index path for a species.
def verify_genome_status(
genome_dir: Path,
work_dir: Path,
species_name: str,
accession: str | None = None,
) -> dict[str, Any]Module: metainformant.rna.genome_prep
Purpose: Check genome download, transcriptome preparation, and index status.
Returns: Dictionary with status flags and paths
def orchestrate_genome_setup(
config_dir: Path = Path("config/amalgkit"),
*,
species: str | None = None,
skip_download: bool = False,
skip_prepare: bool = False,
skip_build: bool = False,
) -> dict[str, Any]Module: metainformant.rna.genome_prep
Purpose: Run complete genome setup pipeline for all or specific species.
See Also: Genome Setup Guide
Functions for managing multi-species workflows and monitoring progress.
def discover_species_configs(
config_dir: Path = Path("config/amalgkit")
) -> dict[str, dict[str, Any]]Module: metainformant.rna.orchestration
Purpose: Discover all species configuration files in config directory.
Returns: Dictionary mapping species names to config dictionaries
def run_workflow_for_species(
config_path: Path,
steps: Sequence[str] | None = None,
*,
check: bool = False,
) -> dict[str, Any]Module: metainformant.rna.orchestration
Purpose: Execute workflow steps for a single species.
Parameters:
config_path: Path to species workflow config filesteps: List of steps to run (default: all steps)check: If True, stop on first failure
Returns: Dictionary with success, completed, failed, return_codes keys
def check_workflow_status(
config_path: Path,
*,
detailed: bool = False,
) -> dict[str, Any]Module: metainformant.rna.orchestration
Purpose: Check workflow status for a species. Unified interface that delegates to monitoring functions.
Parameters:
config_path: Path to species workflow config filedetailed: If True, return detailed analysis viaanalyze_species_status(); if False, return progress summary viacheck_workflow_progress()
Returns: Dictionary with status information (format depends on detailed parameter)
Note: This is a convenience wrapper. For direct access:
- Use
check_workflow_progress()for progress summary - Use
analyze_species_status()for detailed analysis
def cleanup_unquantified_samples(
config_path: Path,
*,
log_dir: Path | None = None,
) -> tuple[int, int]Module: metainformant.rna.orchestration
Purpose: Quantify downloaded samples and cleanup FASTQs. Finds all samples with FASTQ files but no quantification results, quantifies each sample, and deletes FASTQ files after successful quantification.
Parameters:
config_path: Path to species workflow config filelog_dir: Optional log directory
Returns: Tuple of (quantified_count, failed_count)
def monitor_workflows(
species_configs: dict[str, Path],
watch_interval: int = 60,
) -> NoneModule: metainformant.rna.orchestration
Purpose: Real-time monitoring of multiple species workflows.
Parameters:
species_configs: Dictionary mapping species_id -> config_pathwatch_interval: Update interval in seconds (default: 60)
Note: This function runs continuously until interrupted (Ctrl+C). It displays a real-time dashboard showing progress for all monitored species.
Core utilities for CLI interaction and parameter handling.
def check_cli_available() -> tuple[bool, str]Module: metainformant.rna.amalgkit
Purpose: Check if amalgkit CLI is available on PATH.
Returns: Tuple of (is_available: bool, help_text_or_error: str)
def ensure_cli_available(
*,
auto_install: bool = False
) -> tuple[bool, str, dict | None]Module: metainformant.rna.amalgkit
Purpose: Ensure amalgkit CLI is available, optionally attempting auto-install.
Returns: Tuple of (ok: bool, message: str, install_record: dict | None)
def build_cli_args(
params: AmalgkitParams | None,
*,
for_cli: bool = False
) -> list[str]Module: metainformant.rna.amalgkit
Purpose: Convert parameter dictionary to CLI argument list.
Parameters:
params: Parameter mappingfor_cli: If True, use snake_case flags (for actual CLI); if False, use kebab-case (for display)
Returns: List of CLI argument strings
def build_amalgkit_command(
subcommand: str,
params: AmalgkitParams | None = None
) -> list[str]Module: metainformant.rna.amalgkit
Purpose: Build complete amalgkit command as token list.
Example: build_amalgkit_command("metadata", {"threads": 8}) → ["amalgkit", "metadata", "--threads", "8"]
def run_amalgkit(
subcommand: str,
params: AmalgkitParams | None = None,
*,
work_dir: str | Path | None = None,
env: Mapping[str, str] | None = None,
check: bool = False,
capture_output: bool = True,
log_dir: str | Path | None = None,
step_name: str | None = None,
) -> subprocess.CompletedProcess[str]Module: metainformant.rna.amalgkit
Purpose: Execute amalgkit subcommand with optional logging and directory management.
Parameters:
subcommand: Amalgkit subcommand name (e.g., "metadata", "quant")params: Parameter dictionarywork_dir: Working directory (created if missing)env: Additional environment variablescheck: Raise exception on non-zero exitcapture_output: Capture stdout/stderrlog_dir: Directory for timestamped log filesstep_name: Optional label for log filenames
Returns: subprocess.CompletedProcess[str] with execution results
Functions for sample-level processing pipelines.
def quantify_sample(
sample_id: str,
metadata_rows: list[dict[str, Any]],
quant_params: Mapping[str, Any],
*,
log_dir: Path | None = None,
step_name: str | None = None,
) -> tuple[bool, str, Path | None]Module: metainformant.rna.engine.workflow_steps
Purpose: Quantify a single sample using amalgkit quant.
Returns: Tuple of (success: bool, message: str, abundance_file: Path | None)
def convert_sra_to_fastq(
sample_id: str,
sra_file: Path,
output_dir: Path,
*,
threads: int = 4,
log_dir: Path | None = None,
) -> tuple[bool, str, list[Path]]Module: metainformant.rna.engine.sra_extraction
Purpose: Convert a local SRA file to FASTQ format. Prefers parallel-fastq-dump (works better with local files) and falls back to fasterq-dump if needed. Automatically compresses output FASTQ files.
Args:
sample_id: SRA accession ID (e.g., "SRR1234567")sra_file: Path to the SRA fileoutput_dir: Directory where FASTQ files should be writtenthreads: Number of threads for conversion (default: 4)log_dir: Optional directory for log files
Returns: Tuple of (success: bool, message: str, fastq_files: list[Path]). fastq_files contains paths to created FASTQ files (may be empty if failed).
Notes:
- Automatically detects and uses the real
fasterq-dumpbinary (not wrapper scripts) - Passes
--size-check offtofasterq-dumpto prevent "disk-limit exceeded" errors - Automatically compresses output using
pigzorgzip - Checks for existing FASTQ files before conversion
def delete_sample_fastqs(
sample_id: str,
fastq_dir: Path
) -> NoneModule: metainformant.rna.engine.sra_extraction
Purpose: Delete FASTQ files for a specific sample.
def run_download_quant_workflow(
metadata_path: str | Path,
getfastq_params: Mapping[str, Any] | None = None,
quant_params: Mapping[str, Any] | None = None,
*,
work_dir: str | Path | None = None,
log_dir: str | Path | None = None,
num_workers: int = 1,
max_samples: int | None = None,
skip_completed: bool = True,
progress_monitor: DownloadProgressMonitor | None = None,
) -> dict[str, Any]Module: metainformant.rna.engine.pipeline
Purpose: Unified function for download-quantify-delete workflows. Supports both sequential and parallel processing modes.
Parameters:
metadata_path: Path to metadata TSV with sample listgetfastq_params: Parameters for amalgkit getfastq stepquant_params: Parameters for amalgkit quant stepwork_dir: Working directory for amalgkit commandslog_dir: Directory for step logsnum_workers: Number of parallel download workers (default: 1 for sequential mode)num_workers=1: Sequential mode (one sample at a time, maximum disk efficiency)num_workers>1: Parallel mode (N downloads in parallel, sequential quantification)
max_samples: Optional limit on number of samples to processskip_completed: If True, skip samples that are already quantified (sequential mode only)progress_monitor: Optional progress monitor for tracking downloads
Returns: Dictionary with processing statistics:
total_samples: Total number of samplesprocessed: Number of samples successfully processedskipped: Number of samples skipped (already done)failed: Number of samples that failedfailed_runs: List of run IDs that failed
Processing Modes:
- Sequential (
num_workers=1): Process one sample at a time. Maximum disk efficiency: only one sample's FASTQs exist at a time. - Parallel (
num_workers>1): Multiple download workers fetch FASTQ files in parallel, then a single quantification worker processes them sequentially. Maximizes throughput while preventing disk exhaustion.
See Also: Architecture Documentation
AmalgkitParams = Mapping[str, Any]Module: metainformant.rna.amalgkit
Purpose: Type alias for amalgkit parameter dictionaries.
Functions for tracking workflow progress and sample status.
def count_quantified_samples(config_path: Path) -> tuple[int, int]Module: metainformant.rna.monitoring
Purpose: Count quantified and total samples for a species.
Returns: Tuple of (quantified_count, total_count)
def get_sample_status(config_path: Path, sample_id: str) -> dict[str, Any]Module: metainformant.rna.monitoring
Purpose: Get detailed status for a single sample.
Returns: Dictionary with status information:
quantified: boolhas_fastq: boolhas_sra: boolis_downloading: boolstatus: str ("quantified", "downloading", "has_fastq", "has_sra", "undownloaded")
def analyze_species_status(config_path: Path) -> dict[str, Any]Module: metainformant.rna.monitoring
Purpose: Comprehensive analysis of species workflow status.
Returns: Dictionary with comprehensive status information:
total_in_metadata: intquantified: intquantified_and_deleted: intquantified_not_deleted: intdownloading: intfailed_download: intundownloaded: intcategories: dict mapping category -> list of sample_ids
def find_unquantified_samples(config_path: Path) -> list[str]Module: metainformant.rna.monitoring
Purpose: Find all unquantified samples.
Returns: List of sample IDs that are not quantified
def check_active_downloads() -> set[str]Module: metainformant.rna.monitoring
Purpose: Check for samples currently being downloaded.
Returns: Set of sample IDs that are actively downloading
def check_workflow_progress(config_path: Path) -> dict[str, Any]Module: metainformant.rna.monitoring
Purpose: Get workflow progress summary for a species.
Returns: Dictionary with progress information:
quantified: inttotal: intpercentage: floatremaining: intdownloading: int (number of samples currently downloading)has_files: int (number of samples with downloaded files but not quantified)
Note: This is called by check_workflow_status() when detailed=False. Use check_workflow_status() for the unified interface.
def assess_all_species_progress(
config_dir: Path,
*,
repo_root: Path | None = None,
) -> dict[str, dict[str, Any]]Module: metainformant.rna.monitoring
Purpose: Assess progress for all species in config directory.
Returns: Dictionary mapping species_id -> progress information
def initialize_progress_tracking(
config_path: Path,
*,
tracker=None,
) -> dict[str, Any]Module: metainformant.rna.monitoring
Purpose: Initialize progress tracking for a species.
Returns: Dictionary with initialization results
Functions for checking tool availability and environment validation.
def check_amalgkit() -> tuple[bool, str]Module: metainformant.rna.environment
Purpose: Check if amalgkit is available and get version.
Returns: Tuple of (is_available: bool, message: str)
def check_sra_toolkit() -> tuple[bool, str]Module: metainformant.rna.environment
Purpose: Check if SRA Toolkit is installed.
Returns: Tuple of (is_available: bool, message: str)
def check_kallisto() -> tuple[bool, str]Module: metainformant.rna.environment
Purpose: Check if kallisto is installed.
Returns: Tuple of (is_available: bool, message: str)
def check_metainformant() -> tuple[bool, str]Module: metainformant.rna.environment
Purpose: Check if metainformant package is installed.
Returns: Tuple of (is_available: bool, message: str)
def check_virtual_env() -> tuple[bool, str]Module: metainformant.rna.environment
Purpose: Check if running inside a virtual environment.
Returns: Tuple of (is_in_venv: bool, message: str)
def check_rscript() -> tuple[bool, str]Module: metainformant.rna.environment
Purpose: Check if Rscript is available.
Returns: Tuple of (is_available: bool, message: str)
def check_dependencies() -> dict[str, tuple[bool, str]]Module: metainformant.rna.environment
Purpose: Check all required dependencies for RNA-seq workflows.
Returns: Dictionary mapping dependency name -> (is_available: bool, message: str)
def validate_environment() -> dict[str, Any]Module: metainformant.rna.environment
Purpose: Comprehensive environment validation.
Returns: Dictionary with validation results:
all_passed: booldependencies: dict mapping name ->(is_available, message)recommendations: list of strings with recommendations
Functions for cleaning up partial downloads and fixing file naming issues.
def cleanup_partial_downloads(
config_path: Path,
*,
dry_run: bool = False,
) -> dict[str, Any]Module: metainformant.rna.cleanup
Purpose: Clean up partial downloads for a species.
Parameters:
config_path: Path to species workflow config filedry_run: If True, only report what would be deleted
Returns: Dictionary with deleted, freed_mb, errors keys
def fix_abundance_naming(quant_dir: Path, sample_id: str) -> boolModule: metainformant.rna.cleanup
Purpose: Create symlink from abundance.tsv to {SRR}_abundance.tsv for amalgkit merge.
Parameters:
quant_dir: Directory containing quantification resultssample_id: Sample ID (e.g., 'SRR1234567')
Returns: True if symlink was created or already exists
def fix_abundance_naming_for_species(
config_path: Path,
) -> tuple[int, int]Module: metainformant.rna.cleanup
Purpose: Fix abundance naming for all samples in a species.
Returns: Tuple of (created_count, already_exists_count)
Functions for RNA-protein integration, translation efficiency analysis, and ribosome profiling.
def calculate_translation_efficiency(
rna_df: pd.DataFrame,
protein_df: pd.DataFrame,
method: str = "ratio",
) -> pd.DataFrameModule: metainformant.rna.analysis.protein_integration
Purpose: Calculate translation efficiency from RNA and protein levels.
Parameters:
rna_df: RNA expression data (samples × genes)protein_df: Protein abundance data (samples × genes)method: Calculation method ("ratio" or "correlation")
Returns: DataFrame with columns gene_id, efficiency, method
def predict_protein_abundance_from_rna(
rna_df: pd.DataFrame,
training_rna: pd.DataFrame | None = None,
training_protein: pd.DataFrame | None = None,
method: str = "linear",
) -> pd.DataFrameModule: metainformant.rna.analysis.protein_integration
Purpose: Predict protein abundance levels from RNA expression data.
Parameters:
rna_df: RNA expression data for predictiontraining_rna: Optional RNA data from training settraining_protein: Optional protein data from training setmethod: Prediction method ("linear" or "lognormal")
Returns: DataFrame with predicted protein abundance (same shape as rna_df)
def ribosome_profiling_integration(
rna_df: pd.DataFrame,
ribo_df: pd.DataFrame,
) -> pd.DataFrameModule: metainformant.rna.analysis.protein_integration
Purpose: Integrate ribosome profiling data with RNA-seq to identify translationally regulated genes.
Parameters:
rna_df: RNA-seq expression data (samples × genes)ribo_df: Ribosome profiling data (samples × genes)
Returns: DataFrame with columns:
gene_id: Gene identifierrna_level: Mean RNA expressionribo_level: Mean ribosome occupancytranslation_rate: Ribo/RNA ratiotranslationally_regulated: Boolean indicating significant regulation
Functions for validating RNA-seq pipeline outputs and sample quality.
def validate_all_samples(config: AmalgkitWorkflowConfig) -> dict[str, Any]Module: metainformant.rna.analysis.validation
Purpose: Validate all samples in a workflow for pipeline completion.
Parameters:
config: Workflow configuration
Returns: Dictionary with validation results including per-sample status and summary statistics
Functions for discovering species with RNA-seq data and generating configurations.
def search_species_with_rnaseq(
search_query: str,
*,
max_records: int = 10000,
) -> dict[str, dict[str, Any]]Module: metainformant.rna.discovery
Purpose: Search NCBI SRA for species with RNA-seq data.
Parameters:
search_query: NCBI Entrez search querymax_records: Maximum number of records to retrieve
Returns: Dictionary mapping species names to metadata
Raises: ImportError if Biopython is not available
def get_genome_info(taxonomy_id: str, species_name: str) -> dict[str, Any] | NoneModule: metainformant.rna.discovery
Purpose: Get genome assembly information for a species.
Parameters:
taxonomy_id: NCBI taxonomy IDspecies_name: Scientific name
Returns: Genome information dictionary or None
def generate_config_yaml(
species_name: str,
species_data: dict[str, Any],
genome_info: dict[str, Any] | None = None,
*,
repo_root: Path | None = None,
) -> strModule: metainformant.rna.discovery
Purpose: Generate amalgkit YAML configuration for a species.
Parameters:
species_name: Scientific namespecies_data: RNA-seq metadatagenome_info: Genome assembly metadata (optional)repo_root: Repository root directory for paths (optional)
Returns: YAML configuration string
- Function Index - Quick reference table
- Step Documentation - Detailed step guides
- Workflow Guide - Workflow planning and execution
- Configuration Guide - Configuration management