This repository contains the code used to build and polish the bait sequence collection for the two_odd_annotator. However, this pipeline may also be serving as a blueprint in similar projects, where bait sequence collection for e.g., another gene superfamilies is done.
The two_odd_annotator is a tool for automated functional annotation for plant genes - in particularly, the 2-oxoglutarate Fe(II)-dependent dioxygenase (2ODD) superfamily.
It is based on the collection of well-characterized 2ODD genes. For one, those "bait" sequences help to pre-filter a set of input gene sequences for 2ODDs based on sequence similarity.
But more importantly, they build functional clusters in a phylogenetic tree, which then serves as the base for annotating candidate genes (input sequences).
This phylogenetic-evidence-based annotation has been adapted from previous work from the PuckerLab. Functional annotators for the flavonoid biosynthesis can be found on GitHub and are freely available on their server.
Building a characterized bait sequence collection is a manual task per se, where extensive literature research is required. Automating this is nearly impossible. However, it is possible to automatically retrieve the amino acid sequence if an accession number is provided. For this, the UniProt API can be used. Further constraint checks can be implemented to ensure the quality of the collected bait sequences.
As the set of characterized sequences is often limited to well-studied functions with a bias towards model organisms, it is recommended to extend the bait sequence collection. This is done by incorporating the datasets of various plant species, covering a wide taxonomic range. The datasets used here were taken from PuckerLab.
Thus, the main goal is to build a comprehensive bait sequence collection, which is as complete as possible, covering a wide taxonomic range and all known (and maybe novel) functions of the 2ODD superfamily.
You can always argue: the more sequences, the better the annotation, right? Well, obviously, this comes with the cost of increased runtime. So, it is important to find a good balance between the number of bait sequences and the quality of the annotation pipeline.
For this, the bait sequence collection can be optimized by reducing the number of sequences while maintaining the taxonomic and functional diversity.
-
At the beginning of the bait_sequence_collection pipeline is the collection of characterized bait sequences. It is used for documenting all necessary information during literature research for functionally characterized gene sequences (e.g., of the 2ODD gene superfamily).
- Each row represents one curated gene sequence.
- Minimum columns to fill out:
- accession
$\rightarrow$ for fetching the amino acid sequence via UniProt API. If no accession is available, the amino acid sequence itself can be provided in the 'aa_sequence' column. Then the accession must get assigned a UNIQUE placeholder (e.g., AraTha1) - 'gene_family'
$\rightarrow$ The target gene family, here: "2ODD" - 'function'
$\rightarrow$ The actual function the bait gene is performing within the plant, e.g. FLS (short for Flavonol synthase). Be consistent: not fls, or AtFLS or FLS1, only FLS for all FLS gene sequences! - 'metabolic_function'
$\rightarrow$ The metabolic pathway the gene is involved in, e.g. "flavonoid biosynthesis". - 'organism'
$\rightarrow$ The scientific name of the species from which the gene sequence is from - 'doi'
$\rightarrow$ The publication that functionally characterized the gene.
- accession
- It is also helpful for downstream analysis to assign functions (e.g., FLS, F3H, FNSI, LDOX) to one metabolic category (e.g., flavonoid biosynthesis) only.
The characterized bait sequences can be reduced down to non-redundant paralogous clades in the phylogenetic tree.
The pre-filering steps of the two_odd_annotator uses the characterized baits as reference sequences. There are three different methods, each for which needs a reference build:
- Hidden Markov Model (HMM) profile for HMMER
- DIAMOND database for DIAMOND blastp
- BLAST+ database for blastp
-
Expand the bait sequence collection by incorporating 2ODD-like sequences from various plant species, covering a wide taxonomic range.
- From more than 300 available plant datasets, 200 species were selected in four blocks of 50 species each, using taxonomic‑rank–aware sampling at the order level (sampling without replacement per order; when an order was depleted, species were drawn from the remaining orders), yielding four taxonomically diverse blocks.
- For all selected species, 2ODD‑like sequences were retrieved using three complementary pre‑filtering methods (HMMER, DIAMOND blastp, BLAST+ blastp). Sequences shorter or longer than 100 amino acids from the median are removed by default.
- The bait sequence collection was then extended iteratively: for each block, the pre‑filtered 2ODD‑like sequences were merged with the current bait FASTA (initially the characterized 2ODDs), a multiple sequence alignment and phylogenetic tree were computed, and redundant paralogous clades within species were collapsed to single representative sequences. The resulting post‑reduction FASTA served as input for the next block, until all 200 species had been incorporated.
- The merged collection was further refined by (i) manually removing noisy clades, (ii) keeping only sequences with at least one detectable 2ODD HMMER domain, and (iii) removing clusters found by TreeCluster that were smaller than 60 sequences long, resulting in a cleaned tree of roughly 14,000 sequences.
- This refined tree was clustered again with TreeCluster, not for filtering but for defining 2ODD functional clades. Clades with ≥60 sequences or containing at least one characterized bait were considered “major” clusters, as they typically formed well‑supported functional clades spanning a broad taxonomic range (often from gymnosperms to dicots). Smaller clades (<60 sequences) without strong support or with a narrower taxonomic range were classified as “minor” clusters.
- Each retained cluster was assigned a unique 2ODD ID, which forms the basis for annotating candidate genes in the two_odd_annotator.
- 2ODD IDs that contain characterized bait sequences can be associated with known functions and metabolic pathways, enabling functional interpretation of newly annotated candidates.
-
Optimize the bait sequence collection by reducing the number of sequences while maintaining the taxonomic and functional diversity.
- The 14K tree was reduced by keeping only one representative sequence per cluster of one taxonomic family (or below, e.g. genus). I.e., if a cluster contains 10 sequences from the same family, only one of those sequences is kept. This resulted in a reduced bait sequence collection of 5.5k sequences.
- The reduced bait sequence however will not result in the same tree cluster formation as important information has been lost. Some 2ODD IDs do not form the same monophyletic clades as before. This is an issue for the annotation part of the two_odd_annotator. That means, for the annotation part, it is best, when the bait sequences of one 2ODD ID cluster form a monophyletic clade. When a candidate gene falls into this particular clade, it can be assumed that the candidate gene also shows 2ODD01 functionality. However, if, let's say, the 2ODD02 clusters did not form a monophyletic clade, it must be assumed that something went wrong during the tree construction (due to chance and the lack of sufficient sequences).
- Therefore, the 5.5k family-level-reduced tree was further improved towards a more stable cluster formation. This is done by constructing a phylogenetic tree and identifying those clusters that did not resolve in clusters that contain > 80% of the used 2ODD ID sequences. For those clusters, random sequences from the original 14K tree were added until the cluster formation was improved. This process was repeated until all clusters showed an overall stable cluster formation.