You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
thank you very much for this amazing pipeline! Since I had quite a bit of mapping outside annotated regions, I decided to re-annotate a former, original annotation. I refer to this original annotation below and compare my results to it. Katharina Hoff suggested in her talk to compare the output to close relatives for quality assurance. How do I know which values would be expected if the close relatives do not have good annotation either?
I am aware that BRAKER4 is now out and has a dual workflow, so I will definitely give that a try as well!
Using Singularity, I have used your BRAKER3 container on both long and short reads as well as protein sequences (Arthropoda ODB12). The genome was softmasked by the original authors and the --softmasking flag was therefore used when running braker3. All RNAseq data was mapped to the genome with HISAT2 (short reads) and minimap2 (long reads), respectively. The individual output files were merged to All_short.bam and All_long.bam, sorted and indexed using samtools. The busco_lineage was created beforehand as I am working on HPC and ran into firewall restrictions.
Just to save you some confusion: Since I had to restart BRAKER3 for long reads, I used the previously created hintsfile.gff instead of regenerating it which is why you find GeneMark-ET as part of this path below.
I am now wondering about the proper way to merge them. I have found a few issues on this topic (i.e. #50 ), but have not found a solution yet. I also have a few more points which surprised me and hope that somebody can clarify my confusion.
In the old long_read_protocol.md (https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/long_reads/long_read_protocol.md), the augustus.hints.gtf of both runs and the gmst.global.gtf are merged. In BRAKER3.sif, braker.gtf is already a merged form of Augustus and GeneMark output though. I have now tried three approaches in combination with the braker3.cfg file as is:
Merging augustus.hints.gtf of both runs, using hintsfile.gff of both runs
Merging augustus.hints.gtf and genemark.gtf of both runs, using hintsfile.gff of both runs
Merging braker.gtf of both runs, using hintsfile.gff of both runs
I would assume that the third option (using braker.gtf instead of intermediate files) is the correct one. Whiöe I am not sure, this is how I am reading Lars Gabriel's response on issue #18
Things that made me wonder:
I am surprised to see that my BUSCO score is lower when merging both braker.gtf files compared to when only using both braker.aa output files. BUSCO v.5.4.3 was used in protein mode (protein sequences generated using gffread if no output was generated by BRAKER) with arthropoda_odb10 as this is the highest ODB version supported by this BUSCO version. I saw that benchmarking TSEBRA with BUSCO: my results are not good #6 had similar issues - I translated the genes using gffread as gtf2aa was returning an error, but I will try to solve my issue with gtf2aa while waiting for a response. If an alternative translation script solves the problem, I will edit my issue accordingly.
Furthermore, the mono:multi ratio (output or analyze_exons.py) is >1 which I found surprising. In the original annotation it was 0.4
The braker.gtf of my short-read run contains very few genes (<9,000 while the original annotation contains >19,000; augustus and braker GTF both hold >30,000 genes for this run). This was a bit surprising as the short-read data I am using should be the same as the authors of the original annotation used. They had used BRAKER2 back then and used Arthropoda ODB10 instead of ODB12.
The duplicate rate is quite high in my BUSCO, however, I read in the FAQs that this is expected as alternative splicing isoforms are interpreted as duplicates. However, in New default.cfg increased duplicated BUSCO #29 it was suggested that: "One explanation for your results could be that your coverage is not very high." I am wondering how I know whether I have to adjust the intron_support value or whether this is simply because of alternative splicing.
Please find below all data generated as described above. Any suggestions are highly appreciated!
Please find below some quality assessment for each approach and the individual output files:
Dear BRAKER team,
thank you very much for this amazing pipeline! Since I had quite a bit of mapping outside annotated regions, I decided to re-annotate a former, original annotation. I refer to this original annotation below and compare my results to it. Katharina Hoff suggested in her talk to compare the output to close relatives for quality assurance. How do I know which values would be expected if the close relatives do not have good annotation either?
I am aware that BRAKER4 is now out and has a dual workflow, so I will definitely give that a try as well!
Using Singularity, I have used your BRAKER3 container on both long and short reads as well as protein sequences (Arthropoda ODB12). The genome was softmasked by the original authors and the --softmasking flag was therefore used when running braker3. All RNAseq data was mapped to the genome with HISAT2 (short reads) and minimap2 (long reads), respectively. The individual output files were merged to All_short.bam and All_long.bam, sorted and indexed using samtools. The busco_lineage was created beforehand as I am working on HPC and ran into firewall restrictions.
Just to save you some confusion: Since I had to restart BRAKER3 for long reads, I used the previously created hintsfile.gff instead of regenerating it which is why you find GeneMark-ET as part of this path below.
I am now wondering about the proper way to merge them. I have found a few issues on this topic (i.e. #50 ), but have not found a solution yet. I also have a few more points which surprised me and hope that somebody can clarify my confusion.
In the old long_read_protocol.md (https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/long_reads/long_read_protocol.md), the augustus.hints.gtf of both runs and the gmst.global.gtf are merged. In BRAKER3.sif, braker.gtf is already a merged form of Augustus and GeneMark output though. I have now tried three approaches in combination with the braker3.cfg file as is:
I would assume that the third option (using braker.gtf instead of intermediate files) is the correct one. Whiöe I am not sure, this is how I am reading Lars Gabriel's response on issue #18
Things that made me wonder:
Please find below all data generated as described above. Any suggestions are highly appreciated!
Please find below some quality assessment for each approach and the individual output files:
<style> </style>