Easy Annotation of Alternative Splicing Events Coupled with Nonsense-Mediated mRNA Decay (EANMD) is written in Perl to predict alternative splicing's potential to trigger NMD based on 50-nt rule: premature stop-codon before last exon-exon junctions (DJ) more than 50 nt. XGBoost is used to predict the NMD efficiency score based on transcripts factors.
- Support long-read and short-read RNA-seq sourced AS events.
- Support wide range species.
- Support predicting NMD types for original and new isoforms, as well as NMD_in and NMD_ex clustering.
- Support predicting NMD types for novel Skipped Exon (SE), Intron Retention (IR), Alternative 5' Splicing Site (A5SS), and Alternative 3' Splicing Site (A3SS) events.
- Support customizing the 50-nt rule.
- Support multi-threading.
- Support filtering out non-ATG-started transcripts, MXE events
- Support sorting and scoring AS events using our trained Xgboost model.
You can download the EANMD from here.
Simply unzip the downloaded zip file:
unzip EANMD-main.zipNavigate to the extracted folder and run EANMD:
cd EANMD-main
perl EANMD_GENCODE -h
If the screen displays usage and version information. It works.
If it displays you need Array::Utils module, please install it by:
cpan Array::Utils
If you need Perl Parallel::ForkManager module. You could install it by command:
cpan Parallel::ForkManagerTo run the test, run the following command in the terminal:
0. Download reference genome and GTF annotation, if you don't have them before.
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.p6.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/gencode.vM25.primary_assembly.annotation.gtf.gzUnzip them:
gunzip GRCm38.p6.genome.fa.gz
gunzip gencode.vM25.primary_assembly.annotation.gtf.gz1. Run EANMD, test the 28 mouse AS events.
perl EANMD_GENCODE -g gencode.vM25.primary_assembly.annotation.gtf -in TestDataset/TestMouseMM10_SE28.input.txt GRCm38.p6.genome.faIf it runs, the test passes.
perl EANMD_GENCODE [options] -g <string|GTF annotation file> -in <string|input AS-event list> <input genome FASTA file(s)>
Options:
[-o string|Output prefix. Default: getseqsOut]
[-p int |Threads number. Default: 1]
[-d int |Distance to the last exon-exon junction. Default: 50]
[-s string|Specify attribute type in GTF annotation for sorting. Default: gene_id]
[-f string|Specify feature type in GTF annotation. Default: '']
-
Run EANMD, input GTF, AS events and Genome, output an outCombined.txt file.
perl EANMD_GENCODE -g gencode.vM25.primary_assembly.annotation.gtf -p 4 -o TestMouseMM10_SE28.EANMD -in TestMouseMM10_SE28.input.txt GRCm38.p6.genome.faMain output file:
TestMouseMM10_SE28.EANMD.outCombined.txtNote: The default version of EANMD only used the US End and DS Start boundaries to fuzzy search the reference transcripts. If you want to search the exact coordinates of US and DS boundaries (contain both Start and End), you could use the
EANMD_GENCODE_strictversion. -
Filter non-ATG-started transcripts and MXE results.
perl EANMDFilterOutSE.pl -o TestMouseMM10_SE28.EANMD.f TestMouseMM10_SE28.EANMD.outCombined.txtMain output file:
TestMouseMM10_SE28.EANMD.f.filter.ATG.outCombined.txtNote: SE events use EANMDFilterOutSE.pl; A5SS events use EANMDFilterOutA5SS.pl; A3SS and IR events use EANMDFilterOutA3SS.pl
-
Summarize the unique NMD Flag for each AS events. (Requires R environment)
Rscript EANMDflagcount.R -i TestMouseMM10_SE28.EANMD.f.filter.ATG.outCombined.txt -o TestMouseMM10_SE28.EANMD.fMain output file:
TestMouseMM10_SE28.EANMD.f.FinalUniqNMDflag.txtNote: Could use
EANMDflagcount_reverse.Rto get reverse ordered flags. The 'no_info', 'no_NMD', 'No_stop_codon' > 'NMD_in' or 'NMD_ex', the final flag will be the most frequent flag. -
Optional: If you want to get the NMD efficiency score, you could use the
EANMDflagcount_withUTRMFE.Rscript. This script will calculate the NMD Score based on the 3'-UTR length, the maximum distance of stop codon to the last exon-exon junction (Max DJ), 3'-UTR MFE/nt, Minimal stop codon position fraction (Min stop Pos F) and other factors (See Paper Methods). RNAfold, xgboost R package are required for this step. Higher the NMD Score, higher the NMD efficiency. We could filter the NMD events by the NMD Score, for example, NMD Score > 0.131.Rscript EANMDflagcount_withUTRMFE.R -i TestMouseMM10_SE28.EANMD.f.filter.ATG.outCombined.txt -o TestMouseMM10_SE28.EANMD.f
-
Get an AS-events list from filtering rMATS individual-counts results.
perl GetSEinput.pl [options] <input rmats individual-counts result>Options
[-o output prefix. default: rMATS_filtered.out] [-d int|min depth of average read counts. default: 20] [-m int|min count of inclusion events' UP|Downstream junction. default: 2] [-i float|delta PSI cutoff [0-1.0]. default: 0.15] [-f float|FDR cutoff [0-1.0]. default: 0.05] [-c1 int|The first sample numbers. default: 2] [-c2 int|The second sample numbers. default: 2] [-mf float|The US and DS fold change cutoff. default: 0.05]Example:
perl GetSEinput.pl -o Test.SE -d 2 -m 2 -i 0.1 -f 0.05 -c1 2 -c2 2 -mf 0.05 SE.MATS.JCEC.txt-d 2: minimum depth of average read counts per sample is 2. -m 2: minimum junction count for SE inclusion is 2. -i 0.1: delta PSI cutoff is 0.1. -f 0.05: max FDR is 0.05. -c1 2 and -c2 2: Group1 has 2 samples, Group2 has 2 samples. -mf 0.05 The US and DS fold change cutoff is 0.05, meaning the fold of upstream exon or downstream exon to skipped exon junctions reads must be greater than 1:20.
Output file:
Test.SE.EANMD.input.txt. Can be used inEANMDmain process.Note: SE events use GetSEinput.pl; A5SS events use GetA5SSinput.pl; A3SS events use GetA3SSinput.pl
-
Convert unique flag results to GTF
perl TransFlag2GTF.pl [options] <input AS flag file>
| Column | Description | Example |
|---|---|---|
| Gene_name | Gene Symbol | Flna |
| AS | Alternative splicing event SEUSDS Coordinates: SE@US@DS SE|US|DS = chr:start0:end:strand |
chrX:74240815:74240872:-@chrX:74240412:74240550:-@chrX:74241102:74241303:- |
| Optional | Custom column, will remain in output, suggest use gene_id | ENSMUSG00000031328 |
Descriptions of the 38 columns
| Column | Description |
|---|---|
| QueryCol1 | Input Column 1 |
| SEUSDSCoordinates | Input Column 2 |
| QueryCol3 | Input Column 3 |
| Transcript_id | Reference transcript id |
| Strand | Transcript strand |
| Exons | Total exon numbers of the reference transcript |
| Start_exon | Reference start codon exon number |
| Stop_exon | Reference stop codon exon number |
| SE_exon_Number | Skipped exon number for reference transcript |
| SE(US)_Pos | Skipped exon position for reference transcript |
| SE_length | Skipped exon length |
| Ori_CDS_length | Original CDS length |
| Ori_Start_codon_to_exon_end_seq_len | Length of start-codon to exon end |
| rm/add_SE_start_to_end_seq_len | Length of start-codon to exon end after remove or add the SE |
| SEseq | Skipped exon sequence |
| Ori_CDSexons_seq | Original Start codon to exons end sequence |
| rm/add_SE_CDSexons_seq | Sequence of start codon to exons end after remove or add the SE |
| Ori_last_junction_pos | Original last exon-exon junction position |
| Ori_last_dj | Distance of original stop codon to the last exon-exon junction (EJ) |
| Ori_NMD | Original or reference transcript NMD type |
| Start_codon | Start_codon sequence |
| Ori_AA | Original amino acid sequence |
| rm/add_SE_AA | Amino acid sequence after remove or add the SE |
| AA_len+1 | Original amino acid sequence length + 1 (stop codon) |
| Ori_AA_1st_stop_pos | Original amino acid 1st stop codon position |
| Ori_AA_stop_pos | Original amino acid stop codon positions |
| SEed_AA_1st_stop_pos | Amino acid 1st stop codon position after remove or add the SE |
| SEed_AA_stop_pos | Amino acid stop codon positions after remove or add the SE |
| Frame_shift_flag | Frame shift flag |
| New_1st_stop_pos_dj | Distance of the new 1st stop codon to the last EJ |
| NMD_flag | NMD flag after remove or add the SE |
| NMD_in/ex_flag | This AS event NMD type |
| source | This record from which intermediate file |
| SEupstreamCDS | How many nt upstream of the SE |
| SEupstreamAApos | How many AA upstream of the SE |
| UStransexonnumber | US exon number in reference transcript |
| DStransexonnumber | DS exon number in reference transcript |
| innerExonsofUSandDS | inner exon(s) between US and DS |
Visualization of EANMD results is available at: EANMD net: https://zlab1.shinyapps.io/EANMDnet
- @Kaining Hu - Initial work -
This project is licensed under the MIT License - see the LICENSE file for details





