Skip to content

PuckerLab/CYP_Annotator

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CYP-Annotator

Overview

CYP-Annotator is a Python pipeline designed for the automatic, functional annotation of Cytochrome P450 (CYP) sequences. It processes "Subject" datasets (genomes or transcriptomes) to identify CYP candidates by comparing them against a curated set of "Bait" (reference) sequences (https://github.com/k-georgi/CYP_Annotator/tree/main/Data).

The workflow performs the following steps:

  1. Candidate Search: Identifies initial candidates using BLASTp (default) or HMMER (--use_hmmer y).
  2. Phylogenetic Classification: Classifies candidates into "Ingroup" (putative CYPs) and "Outgroup" based on their phylogenetic proximity to the bait and outgroup references.
  3. Ortholog Assignment: Assigns ingroup candidates to known CYP families and subfamilies by calculating patristic distances (phylogenetic tree distance) to the reference baits.
  4. Domain Analysis: Screens candidates for the presence of conserved CYP domains using an HMM profile (--hmm_domains).
  5. Motif Analysis: Checks for conserved protein motifs via static definitions (--protein_motifs) and HMM profiles (--hmm_motifs).
  6. Expression Analysis: If expression data is provided (--expression), it is used to filter out low-expressed genes or potential pseudogenes (--min_avg_tpm).
  7. Paralog Collapsing: Optionally identifies and collapses paralog groups (--collapse y), retaining a single representative sequence.
  8. Summary Generation: Compiles all findings into a comprehensive summary table (08_summary.txt).
  9. Export: Generates tree visualizations (if ete3 is installed) and an interactive HTML overview of all results.

This tool adapts functions and logic from the MYB_annotator (doi: 10.1186/s12864-022-08452-5) and bHLH_annotator (doi: 10.1186/s12864-023-09877-2).

Setup

###Installation in a conda environment

A straightforward way to install the dependencies is by creating a conda environment using the provided environment.yml file:

git clone https://github.com/k-georgi/CYP_Annotator
cd CYP_Annotator
conda env create -f environment.yml
conda activate cyp_annotator

Installation of the dependencies

The following dependencies are necessary for the execution of the pipeline:

  • Python 3:
    • dendropy: pip install dendropy
    • pandas: pip install pandas
    • matplotlib: pip install matplotlib
    • seaborn: pip install seaborn
    • ete3: pip install ete3 (Required for tree visualization)
  • BLAST+: sudo apt install ncbi-blast+
  • HMMER: conda install -c bioconda hmmer
  • MAFFT: sudo apt install mafft
  • MUSCLE: (Installation recommended)
  • FastTree: sudo apt-get install -y fasttree
  • RAxML-NG: (Precompiled binaries recommended)
  • IQ-TREE: (Precompiled binaries recommended)

The CYP_Annotator can be cloned from GitHub:

git clone https://github.com/k-georgi/CYP_Annotator
cd CYP_Annotator

Usage

The pipeline is executed via the command line, using exactly one of the following input methods:

# Method 1: Using a data folder
python3 CYP_Annotator.py --data <PATH_TO_DATA_FOLDER>

# Method 2: Using explicit paths
python3 CYP_Annotator.py --baits <PATH> --subject <PATH> --baits_info <PATH>

# Method 3: Using a file collection CSV
python3 CYP_Annotator.py --file_collection <PATH_TO_FILE_COLLECTION_CSV>

Optional arguments

Optional arguments regarding input/output

Command Description Default
--outgroup <PATH> Path to outgroup FASTA file (optional) None
--hmm <PATH> Path to bait HMM file None
--hmm_domains <PATH> Path to hmm domains file (optional) None
--hmm_motifs <PATH> Path to hmm motifs file (optional) None
--protein_motifs <PATH> Path to protein motifs file (optional) None
--expression <PATH> Path to expression matrix file (optional) None
--metadata <PATH> Path to expression metadata file (optional) None
--output_folder <STR> Name of the output folder (optional) CYP_Annotator_Output
--processed_input_folder <STR> Name of the processed input folder (optional) None
--name <STR> STRING_USED_AS_PREFIX_IN_FILENAMES ""
--trim_names <y/n> Trim sequence names at first space or tab (y/n) y

Optional arguments for tool adjustments

Command Description Default
--mode_aln <STR> Tool used for multiple alignments (mafft/muscle) mafft
--mode_tree <STR> Tool used for tree construction (fasttree/raxml/iqtree) fasttree
--mafft <STR> MAFFT command mafft
--muscle <STR> MUSCLE command muscle
--fasttree <STR> Fasttree command fasttree
--raxml <STR> RAXML command raxml-ng
--iqtree <STR> IQ-TREE command iqtree
--blastp <STR> PATH_TO_AND_INCLUDING_BINARY_BLASTp blastp
--makeblastdb <STR> PATH_TO_AND_INCLUDING_BINARY_MAKEBLASTDB makeblastdb
--hmmsearch <STR> PATH_TO_HMMSEARCH hmmsearch

Optional arguments for search and classification

Command Description Default
--use_hmmer <y/n> Use HMMER for initial candidate search (y/n) n
--simcutp <FLOAT> BLASTP_SIMILARITY_CUTOFF 40.00
--poscutp <INT> BLASTP_POSSIBLE_HIT_NUMBER_PER_BAIT_CUTOFF 100
--lencutp <INT> BLASTP_MIN_LENGTH_CUTOFF 200
--bitcutp <INT> BLASTP_BITSCORE_CUTOFF 80
--filterdomain <y/n> DOMAIN_FILTER_FOR_CLASSIFICATION (y/n) n
--minscore <FLOAT> MINIMAL_SCORE to be considered ingroup 0.5
--numneighbours <INT> NUMBER_OF_NEIGHBOURS_FOR_CLASSIFICATION 24
--neighbourdist <FLOAT> NEIGHBOUR_DISTANCE 5.0
--minneighbours <INT> MINIMAL_NUMBER_OF_NEIGHBOURS 0

Optional arguments for orthologs and paralogs

Command Description Default
--static_pd <y/n> Ortholog assignment with static thresholds (y/n) n
--threshold_factor <FLOAT> Factor for adding deviation/IQR to mean/median in dynamic threshold calculation 0.5
--subfamily_threshold <FLOAT> Theshold for patristic distance considering orthologs 1.1
--family_threshold <FLOAT> Theshold for patristic considering further neighbours 2.7
--individual_tree <y/n> Create individual tree with specific bait sequences (y/n) n
--bait_column <STR> Optional: column name for bait filtering Evidence
--bait_keyword <STR> Keyword for bait filtering Literature
--ortholog_prefix <STR> Prefix for ortholog filtering All
--individual_ortholog_prefix <STR> Prefix for individual ortholog filtering None
--collapse <y/n> Reduce in-paralogs to one representative y
--paralogdist <FLOAT> Distance of paralogs in masking step 10.0

Optional arguments for functional analysis (Domains, Motifs, Expression)

Command Description Default
--domain_Score <FLOAT> c-Evalue for hmm domain integration 100
--motif_cEvalue <FLOAT> c-Evalue for hmm motif integration 0.01
--min_avg_tpm <FLOAT> Average tpm for genes to be considered expressed 1.0
--min_single_tpm <FLOAT> Single tpm for genes to be considered expressed 5.0
--min_paralog_tpm <FLOAT> Min tpm for paralog conservation 1.0

Optional arguments regarding performance

Command Description Default
--cpu_max <INT> Max CPUs 4
--parallel <y/n> Run classification in parallel mode (y/n) y
--num_process_candidates <INT> Max number of candidates per ingroup/outgroup classification 200

Adjustment of input data files

The pipeline relies on several key input files. You can provide them using one of the three methods described in the Usage section. All input files, exept the subject file (needs to be provided by the user) can be found in the repository Data folder (https://github.com/k-georgi/CYP_Annotator/tree/main/Data).

File Argument Description Format
Baits --baits FASTA file containing the reference (bait) sequences, including ingroup (CYPs) and outgroup sequences. .fasta / .fa
Baits Info --baits_info Crucial CSV file. Must contain a header. The first column ID must match the FASTA headers in the baits file. Other columns, like Family, Subfamily, and Evidence, are required for annotation and filtering. .csv
Subject --subject FASTA file(s) or folder(s) containing the sequences (genome, transcriptome) to be annotated. Can be CDS or PEP. .fasta / .fa
Data Folder --data A single folder containing baits.fasta, baits_info.csv, and subject.fasta (or a subjects/ subdirectory). Optional files like hmm_domains.hmm can also be placed here. Folder
File Collection --file_collection A CSV file specifying paths to all other inputs. Overrides all other path arguments if used. .csv

Requirements

Python3, dendropy, pandas, matplotlib, seaborn, ete3, BLAST+, HMMER, MAFFT or MUSCLE, FastTree or RAxML-NG or IQ-TREE

Reference

Georgi K. 'Development of a bioinformatics tool for automatic, functional annotation of plant cytochromes P450.'

Functions used in this script are in part taken from:

About

automatic annotation of P450 CYPs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%