CYP-Annotator

Overview

CYP-Annotator is a Python pipeline designed for the automatic, functional annotation of Cytochrome P450 (CYP) sequences. It processes "Subject" datasets (genomes or transcriptomes) to identify CYP candidates by comparing them against a curated set of "Bait" (reference) sequences (https://github.com/k-georgi/CYP_Annotator/tree/main/Data).

The workflow performs the following steps:

Candidate Search: Identifies initial candidates using BLASTp (default) or HMMER (--use_hmmer y).
Phylogenetic Classification: Classifies candidates into "Ingroup" (putative CYPs) and "Outgroup" based on their phylogenetic proximity to the bait and outgroup references.
Ortholog Assignment: Assigns ingroup candidates to known CYP families and subfamilies by calculating patristic distances (phylogenetic tree distance) to the reference baits.
Domain Analysis: Screens candidates for the presence of conserved CYP domains using an HMM profile (--hmm_domains).
Motif Analysis: Checks for conserved protein motifs via static definitions (--protein_motifs) and HMM profiles (--hmm_motifs).
Expression Analysis: If expression data is provided (--expression), it is used to filter out low-expressed genes or potential pseudogenes (--min_avg_tpm).
Paralog Collapsing: Optionally identifies and collapses paralog groups (--collapse y), retaining a single representative sequence.
Summary Generation: Compiles all findings into a comprehensive summary table (08_summary.txt).
Export: Generates tree visualizations (if ete3 is installed) and an interactive HTML overview of all results.

This tool adapts functions and logic from the MYB_annotator (doi: 10.1186/s12864-022-08452-5) and bHLH_annotator (doi: 10.1186/s12864-023-09877-2).

Setup

###Installation in a conda environment

A straightforward way to install the dependencies is by creating a conda environment using the provided environment.yml file:

git clone https://github.com/k-georgi/CYP_Annotator
cd CYP_Annotator
conda env create -f environment.yml
conda activate cyp_annotator

Installation of the dependencies

The following dependencies are necessary for the execution of the pipeline:

Python 3:
- dendropy: pip install dendropy
- pandas: pip install pandas
- matplotlib: pip install matplotlib
- seaborn: pip install seaborn
- ete3: pip install ete3 (Required for tree visualization)
BLAST+: sudo apt install ncbi-blast+
HMMER: conda install -c bioconda hmmer
MAFFT: sudo apt install mafft
MUSCLE: (Installation recommended)
FastTree: sudo apt-get install -y fasttree
RAxML-NG: (Precompiled binaries recommended)
IQ-TREE: (Precompiled binaries recommended)

The CYP_Annotator can be cloned from GitHub:

git clone https://github.com/k-georgi/CYP_Annotator
cd CYP_Annotator

Usage

The pipeline is executed via the command line, using exactly one of the following input methods:

# Method 1: Using a data folder
python3 CYP_Annotator.py --data <PATH_TO_DATA_FOLDER>

# Method 2: Using explicit paths
python3 CYP_Annotator.py --baits <PATH> --subject <PATH> --baits_info <PATH>

# Method 3: Using a file collection CSV
python3 CYP_Annotator.py --file_collection <PATH_TO_FILE_COLLECTION_CSV>

Optional arguments

Optional arguments regarding input/output

Command	Description	Default
`--outgroup <PATH>`	Path to outgroup FASTA file (optional)	`None`
`--hmm <PATH>`	Path to bait HMM file	`None`
`--hmm_domains <PATH>`	Path to hmm domains file (optional)	`None`
`--hmm_motifs <PATH>`	Path to hmm motifs file (optional)	`None`
`--protein_motifs <PATH>`	Path to protein motifs file (optional)	`None`
`--expression <PATH>`	Path to expression matrix file (optional)	`None`
`--metadata <PATH>`	Path to expression metadata file (optional)	`None`
`--output_folder <STR>`	Name of the output folder (optional)	`CYP_Annotator_Output`
`--processed_input_folder <STR>`	Name of the processed input folder (optional)	`None`
`--name <STR>`	STRING_USED_AS_PREFIX_IN_FILENAMES	`""`
`--trim_names <y/n>`	Trim sequence names at first space or tab (y/n)	`y`

Optional arguments for tool adjustments

Command	Description	Default
`--mode_aln <STR>`	Tool used for multiple alignments (`mafft/muscle`)	`mafft`
`--mode_tree <STR>`	Tool used for tree construction (fasttree/raxml/iqtree)	`fasttree`
`--mafft <STR>`	MAFFT command	`mafft`
`--muscle <STR>`	MUSCLE command	`muscle`
`--fasttree <STR>`	Fasttree command	`fasttree`
`--raxml <STR>`	RAXML command	`raxml-ng`
`--iqtree <STR>`	IQ-TREE command	`iqtree`
`--blastp <STR>`	PATH_TO_AND_INCLUDING_BINARY_BLASTp	`blastp`
`--makeblastdb <STR>`	PATH_TO_AND_INCLUDING_BINARY_MAKEBLASTDB	`makeblastdb`
`--hmmsearch <STR>`	PATH_TO_HMMSEARCH	`hmmsearch`

Optional arguments for search and classification

Command	Description	Default
`--use_hmmer <y/n>`	Use HMMER for initial candidate search (y/n)	`n`
`--simcutp <FLOAT>`	BLASTP_SIMILARITY_CUTOFF	`40.00`
`--poscutp <INT>`	BLASTP_POSSIBLE_HIT_NUMBER_PER_BAIT_CUTOFF	`100`
`--lencutp <INT>`	BLASTP_MIN_LENGTH_CUTOFF	`200`
`--bitcutp <INT>`	BLASTP_BITSCORE_CUTOFF	`80`
`--filterdomain <y/n>`	DOMAIN_FILTER_FOR_CLASSIFICATION (y/n)	`n`
`--minscore <FLOAT>`	MINIMAL_SCORE to be considered ingroup	`0.5`
`--numneighbours <INT>`	NUMBER_OF_NEIGHBOURS_FOR_CLASSIFICATION	`24`
`--neighbourdist <FLOAT>`	NEIGHBOUR_DISTANCE	`5.0`
`--minneighbours <INT>`	MINIMAL_NUMBER_OF_NEIGHBOURS	`0`

Optional arguments for orthologs and paralogs

Command	Description	Default
`--static_pd <y/n>`	Ortholog assignment with static thresholds (y/n)	`n`
`--threshold_factor <FLOAT>`	Factor for adding deviation/IQR to mean/median in dynamic threshold calculation	`0.5`
`--subfamily_threshold <FLOAT>`	Theshold for patristic distance considering orthologs	`1.1`
`--family_threshold <FLOAT>`	Theshold for patristic considering further neighbours	`2.7`
`--individual_tree <y/n>`	Create individual tree with specific bait sequences (y/n)	`n`
`--bait_column <STR>`	Optional: column name for bait filtering	`Evidence`
`--bait_keyword <STR>`	Keyword for bait filtering	`Literature`
`--ortholog_prefix <STR>`	Prefix for ortholog filtering	`All`
`--individual_ortholog_prefix <STR>`	Prefix for individual ortholog filtering	`None`
`--collapse <y/n>`	Reduce in-paralogs to one representative	`y`
`--paralogdist <FLOAT>`	Distance of paralogs in masking step	`10.0`

Optional arguments for functional analysis (Domains, Motifs, Expression)

Command	Description	Default
`--domain_Score <FLOAT>`	c-Evalue for hmm domain integration	`100`
`--motif_cEvalue <FLOAT>`	c-Evalue for hmm motif integration	`0.01`
`--min_avg_tpm <FLOAT>`	Average tpm for genes to be considered expressed	`1.0`
`--min_single_tpm <FLOAT>`	Single tpm for genes to be considered expressed	`5.0`
`--min_paralog_tpm <FLOAT>`	Min tpm for paralog conservation	`1.0`

Optional arguments regarding performance

Command	Description	Default
`--cpu_max <INT>`	Max CPUs	`4`
`--parallel <y/n>`	Run classification in parallel mode (y/n)	`y`
`--num_process_candidates <INT>`	Max number of candidates per ingroup/outgroup classification	`200`

Adjustment of input data files

The pipeline relies on several key input files. You can provide them using one of the three methods described in the Usage section. All input files, exept the subject file (needs to be provided by the user) can be found in the repository Data folder (https://github.com/k-georgi/CYP_Annotator/tree/main/Data).

File	Argument	Description	Format
Baits	`--baits`	FASTA file containing the reference (bait) sequences, including ingroup (CYPs) and outgroup sequences.	`.fasta` / `.fa`
Baits Info	`--baits_info`	Crucial CSV file. Must contain a header. The first column `ID` must match the FASTA headers in the baits file. Other columns, like `Family`, `Subfamily`, and `Evidence`, are required for annotation and filtering.	`.csv`
Subject	`--subject`	FASTA file(s) or folder(s) containing the sequences (genome, transcriptome) to be annotated. Can be CDS or PEP.	`.fasta` / `.fa`
Data Folder	`--data`	A single folder containing `baits.fasta`, `baits_info.csv`, and `subject.fasta` (or a `subjects/` subdirectory). Optional files like `hmm_domains.hmm` can also be placed here.	Folder
File Collection	`--file_collection`	A CSV file specifying paths to all other inputs. Overrides all other path arguments if used.	`.csv`

Requirements

Python3, dendropy, pandas, matplotlib, seaborn, ete3, BLAST+, HMMER, MAFFT or MUSCLE, FastTree or RAxML-NG or IQ-TREE

Reference

Georgi K. 'Development of a bioinformatics tool for automatic, functional annotation of plant cytochromes P450.'

Functions used in this script are in part taken from:

MYB_annotator (doi: 10.1186/s12864-022-08452-5)
bHLH_annotator (doi: 10.1186/s12864-023-09877-2)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Data		Data
.DS_Store		.DS_Store
CYP_Annotator.py		CYP_Annotator.py
README.md		README.md
Sequences_Sources.csv		Sequences_Sources.csv
environment.yml		environment.yml
file_collection.csv		file_collection.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CYP-Annotator

Overview

Setup

Installation of the dependencies

Usage

Optional arguments

Optional arguments regarding input/output

Optional arguments for tool adjustments

Optional arguments for search and classification

Optional arguments for orthologs and paralogs

Optional arguments for functional analysis (Domains, Motifs, Expression)

Optional arguments regarding performance

Adjustment of input data files

Requirements

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CYP-Annotator

Overview

Setup

Installation of the dependencies

Usage

Optional arguments

Optional arguments regarding input/output

Optional arguments for tool adjustments

Optional arguments for search and classification

Optional arguments for orthologs and paralogs

Optional arguments for functional analysis (Domains, Motifs, Expression)

Optional arguments regarding performance

Adjustment of input data files

Requirements

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages