Merging long and short read BRAKER3 output in BRAKER3.sif

Dear BRAKER team,

thank you very much for this amazing pipeline! Since I had quite a bit of mapping outside annotated regions, I decided to re-annotate a former, original annotation. I refer to this original annotation below and compare my results to it. Katharina Hoff suggested in her talk to compare the output to close relatives for quality assurance. How do I know which values would be expected if the close relatives do not have good annotation either? 
I am aware that BRAKER4 is now out and has a dual workflow, so I will definitely give that a try as well!

Using Singularity, I have used your **BRAKER3 container** on both **long** and **short reads** as well as **protein sequences (Arthropoda ODB12)**. The genome was softmasked by the original authors and the **--softmasking flag** was therefore used when running braker3. All RNAseq data was mapped to the genome with **HISAT2 (short reads)** and **minimap2 (long reads)**, respectively. The individual output files were **merged** to All_short.bam and All_long.bam, **sorted** and **indexed using samtools**. The busco_lineage was created beforehand as I am working on HPC and ran into firewall restrictions.
Just to save you some confusion: Since I had to restart BRAKER3 for long reads, I used the previously created hintsfile.gff instead of regenerating it which is why you find GeneMark-ET as part of this path below.

**I am now wondering about the proper way to merge them.** I have found a few issues on this topic (i.e. #50 ), but have not found a solution yet. I also have a few more points which surprised me and hope that somebody can clarify my confusion.

In the old long_read_protocol.md (https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/long_reads/long_read_protocol.md), the augustus.hints.gtf of both runs and the gmst.global.gtf are merged. In BRAKER3.sif, braker.gtf is already a merged form of Augustus and GeneMark output though. I have now tried three approaches in combination with the braker3.cfg file as is: 
1. Merging augustus.hints.gtf of both runs, using hintsfile.gff of both runs
2. Merging augustus.hints.gtf and genemark.gtf of both runs, using hintsfile.gff of both runs
3. Merging braker.gtf of both runs, using hintsfile.gff of both runs

I would assume that the third option (using braker.gtf instead of intermediate files) is the correct one. Whiöe I am not sure, this is how I am reading Lars Gabriel's response on issue #18 

Things that made me wonder:
1. I am surprised to see that my **BUSCO score is lower** when merging both braker.gtf files compared to when only using both braker.aa output files. BUSCO v.5.4.3 was used in protein mode (protein sequences generated using gffread if no output was generated by BRAKER) with arthropoda_odb10 as this is the highest ODB version supported by this BUSCO version. I saw that #6 had similar issues - I translated the genes using gffread as gtf2aa was returning an error, but I will try to solve my issue with gtf2aa while waiting for a response. If an alternative translation script solves the problem, I will edit my issue accordingly. 
2. Furthermore, the **mono:multi ratio** (output or analyze_exons.py) is >1 which I found surprising. In the original annotation it was 0.4
3. The **braker.gtf** of my **short-read run** contains **very few genes** (<9,000 while the original annotation contains >19,000; augustus and braker GTF both hold >30,000 genes for this run). This was a bit surprising as the short-read data I am using should be the same as the authors of the original annotation used. They had used BRAKER2 back then and used Arthropoda ODB10 instead of ODB12.
4. The duplicate rate is quite high in my BUSCO, however, I read in the FAQs that this is expected as alternative splicing isoforms are interpreted as duplicates. However, in #29 it was suggested that: "One explanation for your results could be that your coverage is not very high." I am wondering how I know whether I have to adjust the intron_support value or whether this is simply because of alternative splicing. 

Please find below all data generated as described above. Any suggestions are highly appreciated!

Please find below some quality assessment for each approach and the individual output files:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/ANNIKA~1/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/ANNIKA~1/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
<style>

</style>
</head>

<body link="#467886" vlink="#96607D">


Genome   annotation | path | BUSCO | GTF file size | #genes (awk '{print $3}'   <GTF> \| grep "gene" \| wc -l) | #proteins (grep "^>"   GFFREAD_OUTPUT.fa \| wc -l) | #transcripts (analyze_exons.py) | Max #exons | Monoexonic transcripts | Multiexonic transcripts | Mono:Multi ratio | Min | 25% | 50% | 75% | Max
-- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
Original annotation | NA | C:90.9%[S:80.1%,D:10.8%],F:2.9%,M:6.2%,n:1013 | 35M | 19,421 | 19,402 | 18,892 | 73 | 5,394 | 13,498 | 0.4 | 1 | 1 | 3 | 7 | 73
BRAKER   with Illumina | short/braker | C:92.8%[S:16.5%,D:76.3%],F:2.8%,M:4.4%,n:1013 | 21M | 8,947 | 11,834 | 10,771 | 52 | 2,249 | 8,522 | 0.26 | 1 | 2 | 4 | 8 | 53
BRAKER   with Illumina | short/Augustus | C:85.4%[S:73.0%,D:12.4%],F:8.4%,M:6.2%,n:1013 | 39M | 37,226 | 38,010 | 37,226 | 72 | 18,109 | 19,117 | 0.95 | 1 | 1 | 2 | 4 | 72
BRAKER   with Illumina | short/GeneMark-ETP | C:90.4%[S:77.4%,D:13.0%],F:3.3%,M:6.3%,n:1013 | 48M | 31,202 | 39,051 | 39,051 | 78 | 9,746 | 29,305 | 0.33 | 1 | 2 | 3 | 5 | 78
BRAKER   with PacBio | long/braker | C:91.2%[S:64.7%,D:26.5%],F:4.0%,M:4.8%,n:1013 | 50M | 34,614 | 39,934 | 37,490 | 73 | 14,136 | 23,354 | 0.61 | 1 | 1 | 2 | 5 | 73
BRAKER   with PacBio | long/Augustus | C:91.2%[S:73.0%,D:18.2%],F:4.0%,M:4.8%,n:1013 | 46M | 34,549 | 37,034 | 53,854 | 73 | 14,127 | 20,423 | 0.69 | 1 | 1 | 2 | 4 | 73
BRAKER   with PacBio | long/GeneMark-ET | C:65.1%[S:56.2%,D:8.9%],   F:19.2%,M:15.7%,n:1013 | 46M | 53,854 | 53,854 | 34,550 | 61 | 23,723 | 30,131 | 0.79 | 1 | 1 | 2 | 3 | 61
TSEBRA   merge | tsebra -g long/Augustus/augustus.hints.gtf,short/Augustus/augustus.hints.gtf -e   long/hintsfile.gff,short/hintsfile.gff -c braker3.cfg | C:72.4%[S:57.0%,D:15.4%],F:4.5%,M:23.1%,n:1013 | 33M | 29,905 | 33,533 | 21,288 | 64 | 21,288 | 8,617 | 2.47 | 1 | 1 | 1 | 2 | 64
TSEBRA   merge | tsebra -g long/Augustus/augustus.hints.gtf,short/Augustus/augustus.hints.gtf,long/GeneMark-ET/genemark.gtf,short/GeneMark-ETP/genemark.gtf -e   long/hintsfile.gff,short/hintsfile.gff -c braker3.cfg | C:87.3%[S:21.3%,D:66.0%],F:3.2%,M:9.5%,n:1013 | 65M | 42,185 | 67,402 | 59,748 | 64 | 42,968 | 16,780 | 2.56 | 1 | 1 | 1 | 2 | 64
TSEBRA   merge | tsebra -g long/braker.gtf,short/braker.gtf -e   long/hintsfile.gff,short/hintsfile.gff -c braker3.cfg | C:87.2%[S:21.1%,D:66.1%],F:3.3%,M:9.5%,n:1013 | 44M | 24,883 | 32,228 | 30,945 | 64 | 15,651 | 15,294 | 1.02 | 1 | 1 | 1 | 5 | 64



</body>

</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging long and short read BRAKER3 output in BRAKER3.sif #55

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Genome annotation	path	BUSCO	GTF file size	#genes (awk '{print $3}' \| grep "gene" \| wc -l)	#proteins (grep "^>" GFFREAD_OUTPUT.fa \| wc -l)	#transcripts (analyze_exons.py)	Max #exons	Monoexonic transcripts	Multiexonic transcripts	Mono:Multi ratio	Min	25%	50%	75%	Max
Original annotation	NA	C:90.9%[S:80.1%,D:10.8%],F:2.9%,M:6.2%,n:1013	35M	19,421	19,402	18,892	73	5,394	13,498	0.4	1	1	3	7	73
BRAKER with Illumina	short/braker	C:92.8%[S:16.5%,D:76.3%],F:2.8%,M:4.4%,n:1013	21M	8,947	11,834	10,771	52	2,249	8,522	0.26	1	2	4	8	53
BRAKER with Illumina	short/Augustus	C:85.4%[S:73.0%,D:12.4%],F:8.4%,M:6.2%,n:1013	39M	37,226	38,010	37,226	72	18,109	19,117	0.95	1	1	2	4	72
BRAKER with Illumina	short/GeneMark-ETP	C:90.4%[S:77.4%,D:13.0%],F:3.3%,M:6.3%,n:1013	48M	31,202	39,051	39,051	78	9,746	29,305	0.33	1	2	3	5	78
BRAKER with PacBio	long/braker	C:91.2%[S:64.7%,D:26.5%],F:4.0%,M:4.8%,n:1013	50M	34,614	39,934	37,490	73	14,136	23,354	0.61	1	1	2	5	73
BRAKER with PacBio	long/Augustus	C:91.2%[S:73.0%,D:18.2%],F:4.0%,M:4.8%,n:1013	46M	34,549	37,034	53,854	73	14,127	20,423	0.69	1	1	2	4	73
BRAKER with PacBio	long/GeneMark-ET	C:65.1%[S:56.2%,D:8.9%], F:19.2%,M:15.7%,n:1013	46M	53,854	53,854	34,550	61	23,723	30,131	0.79	1	1	2	3	61
TSEBRA merge	tsebra -g long/Augustus/augustus.hints.gtf,short/Augustus/augustus.hints.gtf -e long/hintsfile.gff,short/hintsfile.gff -c braker3.cfg	C:72.4%[S:57.0%,D:15.4%],F:4.5%,M:23.1%,n:1013	33M	29,905	33,533	21,288	64	21,288	8,617	2.47	1	1	1	2	64
TSEBRA merge	tsebra -g long/Augustus/augustus.hints.gtf,short/Augustus/augustus.hints.gtf,long/GeneMark-ET/genemark.gtf,short/GeneMark-ETP/genemark.gtf -e long/hintsfile.gff,short/hintsfile.gff -c braker3.cfg	C:87.3%[S:21.3%,D:66.0%],F:3.2%,M:9.5%,n:1013	65M	42,185	67,402	59,748	64	42,968	16,780	2.56	1	1	1	2	64
TSEBRA merge	tsebra -g long/braker.gtf,short/braker.gtf -e long/hintsfile.gff,short/hintsfile.gff -c braker3.cfg	C:87.2%[S:21.1%,D:66.1%],F:3.3%,M:9.5%,n:1013	44M	24,883	32,228	30,945	64	15,651	15,294	1.02	1	1	1	5	64

Merging long and short read BRAKER3 output in BRAKER3.sif #55

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions