CYP-Annotator is a Python pipeline designed for the automatic, functional annotation of Cytochrome P450 (CYP) sequences. It processes "Subject" datasets (genomes or transcriptomes) to identify CYP candidates by comparing them against a curated set of "Bait" (reference) sequences (https://github.com/k-georgi/CYP_Annotator/tree/main/Data).
The workflow performs the following steps:
- Candidate Search: Identifies initial candidates using BLASTp (default) or HMMER (
--use_hmmer y). - Phylogenetic Classification: Classifies candidates into "Ingroup" (putative CYPs) and "Outgroup" based on their phylogenetic proximity to the bait and outgroup references.
- Ortholog Assignment: Assigns ingroup candidates to known CYP families and subfamilies by calculating patristic distances (phylogenetic tree distance) to the reference baits.
- Domain Analysis: Screens candidates for the presence of conserved CYP domains using an HMM profile (
--hmm_domains). - Motif Analysis: Checks for conserved protein motifs via static definitions (
--protein_motifs) and HMM profiles (--hmm_motifs). - Expression Analysis: If expression data is provided (
--expression), it is used to filter out low-expressed genes or potential pseudogenes (--min_avg_tpm). - Paralog Collapsing: Optionally identifies and collapses paralog groups (
--collapse y), retaining a single representative sequence. - Summary Generation: Compiles all findings into a comprehensive summary table (
08_summary.txt). - Export: Generates tree visualizations (if
ete3is installed) and an interactive HTML overview of all results.
This tool adapts functions and logic from the MYB_annotator (doi: 10.1186/s12864-022-08452-5) and bHLH_annotator (doi: 10.1186/s12864-023-09877-2).
###Installation in a conda environment
A straightforward way to install the dependencies is by creating a conda environment using the provided environment.yml file:
git clone https://github.com/k-georgi/CYP_Annotator
cd CYP_Annotator
conda env create -f environment.yml
conda activate cyp_annotatorThe following dependencies are necessary for the execution of the pipeline:
- Python 3:
- dendropy:
pip install dendropy - pandas:
pip install pandas - matplotlib:
pip install matplotlib - seaborn:
pip install seaborn - ete3:
pip install ete3(Required for tree visualization)
- dendropy:
- BLAST+:
sudo apt install ncbi-blast+ - HMMER:
conda install -c bioconda hmmer - MAFFT:
sudo apt install mafft - MUSCLE: (Installation recommended)
- FastTree:
sudo apt-get install -y fasttree - RAxML-NG: (Precompiled binaries recommended)
- IQ-TREE: (Precompiled binaries recommended)
The CYP_Annotator can be cloned from GitHub:
git clone https://github.com/k-georgi/CYP_Annotator
cd CYP_AnnotatorThe pipeline is executed via the command line, using exactly one of the following input methods:
# Method 1: Using a data folder
python3 CYP_Annotator.py --data <PATH_TO_DATA_FOLDER>
# Method 2: Using explicit paths
python3 CYP_Annotator.py --baits <PATH> --subject <PATH> --baits_info <PATH>
# Method 3: Using a file collection CSV
python3 CYP_Annotator.py --file_collection <PATH_TO_FILE_COLLECTION_CSV>| Command | Description | Default |
|---|---|---|
--outgroup <PATH> |
Path to outgroup FASTA file (optional) | None |
--hmm <PATH> |
Path to bait HMM file | None |
--hmm_domains <PATH> |
Path to hmm domains file (optional) | None |
--hmm_motifs <PATH> |
Path to hmm motifs file (optional) | None |
--protein_motifs <PATH> |
Path to protein motifs file (optional) | None |
--expression <PATH> |
Path to expression matrix file (optional) | None |
--metadata <PATH> |
Path to expression metadata file (optional) | None |
--output_folder <STR> |
Name of the output folder (optional) | CYP_Annotator_Output |
--processed_input_folder <STR> |
Name of the processed input folder (optional) | None |
--name <STR> |
STRING_USED_AS_PREFIX_IN_FILENAMES | "" |
--trim_names <y/n> |
Trim sequence names at first space or tab (y/n) | y |
| Command | Description | Default |
|---|---|---|
--mode_aln <STR> |
Tool used for multiple alignments (mafft/muscle) |
mafft |
--mode_tree <STR> |
Tool used for tree construction (fasttree/raxml/iqtree) | fasttree |
--mafft <STR> |
MAFFT command | mafft |
--muscle <STR> |
MUSCLE command | muscle |
--fasttree <STR> |
Fasttree command | fasttree |
--raxml <STR> |
RAXML command | raxml-ng |
--iqtree <STR> |
IQ-TREE command | iqtree |
--blastp <STR> |
PATH_TO_AND_INCLUDING_BINARY_BLASTp | blastp |
--makeblastdb <STR> |
PATH_TO_AND_INCLUDING_BINARY_MAKEBLASTDB | makeblastdb |
--hmmsearch <STR> |
PATH_TO_HMMSEARCH | hmmsearch |
| Command | Description | Default |
|---|---|---|
--use_hmmer <y/n> |
Use HMMER for initial candidate search (y/n) | n |
--simcutp <FLOAT> |
BLASTP_SIMILARITY_CUTOFF | 40.00 |
--poscutp <INT> |
BLASTP_POSSIBLE_HIT_NUMBER_PER_BAIT_CUTOFF | 100 |
--lencutp <INT> |
BLASTP_MIN_LENGTH_CUTOFF | 200 |
--bitcutp <INT> |
BLASTP_BITSCORE_CUTOFF | 80 |
--filterdomain <y/n> |
DOMAIN_FILTER_FOR_CLASSIFICATION (y/n) | n |
--minscore <FLOAT> |
MINIMAL_SCORE to be considered ingroup | 0.5 |
--numneighbours <INT> |
NUMBER_OF_NEIGHBOURS_FOR_CLASSIFICATION | 24 |
--neighbourdist <FLOAT> |
NEIGHBOUR_DISTANCE | 5.0 |
--minneighbours <INT> |
MINIMAL_NUMBER_OF_NEIGHBOURS | 0 |
| Command | Description | Default |
|---|---|---|
--static_pd <y/n> |
Ortholog assignment with static thresholds (y/n) | n |
--threshold_factor <FLOAT> |
Factor for adding deviation/IQR to mean/median in dynamic threshold calculation | 0.5 |
--subfamily_threshold <FLOAT> |
Theshold for patristic distance considering orthologs | 1.1 |
--family_threshold <FLOAT> |
Theshold for patristic considering further neighbours | 2.7 |
--individual_tree <y/n> |
Create individual tree with specific bait sequences (y/n) | n |
--bait_column <STR> |
Optional: column name for bait filtering | Evidence |
--bait_keyword <STR> |
Keyword for bait filtering | Literature |
--ortholog_prefix <STR> |
Prefix for ortholog filtering | All |
--individual_ortholog_prefix <STR> |
Prefix for individual ortholog filtering | None |
--collapse <y/n> |
Reduce in-paralogs to one representative | y |
--paralogdist <FLOAT> |
Distance of paralogs in masking step | 10.0 |
| Command | Description | Default |
|---|---|---|
--domain_Score <FLOAT> |
c-Evalue for hmm domain integration | 100 |
--motif_cEvalue <FLOAT> |
c-Evalue for hmm motif integration | 0.01 |
--min_avg_tpm <FLOAT> |
Average tpm for genes to be considered expressed | 1.0 |
--min_single_tpm <FLOAT> |
Single tpm for genes to be considered expressed | 5.0 |
--min_paralog_tpm <FLOAT> |
Min tpm for paralog conservation | 1.0 |
| Command | Description | Default |
|---|---|---|
--cpu_max <INT> |
Max CPUs | 4 |
--parallel <y/n> |
Run classification in parallel mode (y/n) | y |
--num_process_candidates <INT> |
Max number of candidates per ingroup/outgroup classification | 200 |
The pipeline relies on several key input files. You can provide them using one of the three methods described in the Usage section. All input files, exept the subject file (needs to be provided by the user) can be found in the repository Data folder (https://github.com/k-georgi/CYP_Annotator/tree/main/Data).
| File | Argument | Description | Format |
|---|---|---|---|
| Baits | --baits |
FASTA file containing the reference (bait) sequences, including ingroup (CYPs) and outgroup sequences. | .fasta / .fa |
| Baits Info | --baits_info |
Crucial CSV file. Must contain a header. The first column ID must match the FASTA headers in the baits file. Other columns, like Family, Subfamily, and Evidence, are required for annotation and filtering. |
.csv |
| Subject | --subject |
FASTA file(s) or folder(s) containing the sequences (genome, transcriptome) to be annotated. Can be CDS or PEP. | .fasta / .fa |
| Data Folder | --data |
A single folder containing baits.fasta, baits_info.csv, and subject.fasta (or a subjects/ subdirectory). Optional files like hmm_domains.hmm can also be placed here. |
Folder |
| File Collection | --file_collection |
A CSV file specifying paths to all other inputs. Overrides all other path arguments if used. | .csv |
Python3, dendropy, pandas, matplotlib, seaborn, ete3, BLAST+, HMMER, MAFFT or MUSCLE, FastTree or RAxML-NG or IQ-TREE
Georgi K. 'Development of a bioinformatics tool for automatic, functional annotation of plant cytochromes P450.'
Functions used in this script are in part taken from:
- MYB_annotator (doi: 10.1186/s12864-022-08452-5)
- bHLH_annotator (doi: 10.1186/s12864-023-09877-2)