Skip to content

Commit 98f3c2a

Browse files
committed
Version 0.3
1 parent f7a584b commit 98f3c2a

File tree

2 files changed

+98
-46
lines changed

2 files changed

+98
-46
lines changed

README.md

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,18 @@
1-
# backmap.pl v0.2
1+
# backmap.pl v0.3
22

33
## Description
4-
__Automatic read mapping and genome size estimate from coverage.__
4+
__Automatic read mapping and genome size estimation from coverage.__
55

6-
Automatic mapping of paired, unpaired, PacBio and Nanopore reads to an assembly with `bwa mem` or `minimap2`, execution of `qualimap bamqc` and estimation of genome size from mapped nucleotides divided by mode (>0).
7-
The tools `samtools`, `bwa` and/or `minimap2` need to be in your `$PATH`. The tools `qualimap`, `multiqc`, `bedtools` and `Rscript` are optional but needed to create the mapping quality report, coverage histogram and plot of the coverage distribution respectively.
6+
Automatic mapping of paired, unpaired, PacBio and Nanopore reads to an assembly with `bwa mem` or `minimap2`, execution of `qualimap bamqc` and estimation of genome size from mapped nucleotides divided by mode of the coverage distribution (>0). This method was first pulished in Schell et al. (2017) and is slightly refined in this script.
7+
The tools `samtools`, `bwa` and/or `minimap2` need to be in your `$PATH`. The tools `qualimap`, `multiqc`, `bedtools` and `Rscript` are optional but needed to create the mapping quality report, coverage histogram as well as genome size estimation and to plot of the coverage distribution respectively.
88

99
## Dependencies
1010

11-
`backmap.pl` will search for the following executables in your `$PATH`:
11+
`backmap.pl` needs the following perl modules and will search for executables in your `$PATH`:
1212

1313
Mandatory:
1414
- [Number::FormatEng](https://metacpan.org/pod/Number::FormatEng)
15+
- [Parallel::Loops](https://metacpan.org/pod/Parallel::Loops)
1516
- [samtools](https://github.com/samtools/samtools): `samtools`
1617

1718
Short read mapping:
@@ -23,8 +24,8 @@ Long read mapping:
2324
Optional:
2425
- [Qaulimap](http://qualimap.bioinfo.cipf.es/): `qualimap`
2526
- [MultiQC](https://multiqc.info/): `multiqc`
26-
- [bedtools](https://bedtools.readthedocs.io/en/latest/) `bedtools`
27-
- [Rscript](https://www.r-project.org/) `Rscript`
27+
- [bedtools](https://bedtools.readthedocs.io/en/latest/): `bedtools`
28+
- [Rscript](https://www.r-project.org/): `Rscript`
2829

2930
## Usage
3031

@@ -73,4 +74,21 @@ Options: [default]
7374
```
7475

7576
## Citation
76-
If you use this tool please cite the dependencies as well.
77+
Schell T, Feldmeyer B, Schmidt H, Greshake B, Tills O et al. (2017). An Annotated Draft Genome for _Radix auricularia_ (Gastropoda, Mollusca). _Genome Biology and Evolution_, 9(3):585–592, <https://doi.org/10.1093/gbe/evx032>
78+
79+
__If you use this tool please cite the dependencies as well:__
80+
81+
- samtools:
82+
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. (2009). The Sequence Alignment/Map format and SAMtools. _Bioinformatics_, 25(16):2078–2079, <https://doi.org/10.1093/bioinformatics/btp352>
83+
- bwa mem:
84+
Li H (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. _arXiv preprint arXiv:1303.3997_.
85+
- minimap2:
86+
Li H (2018). Minimap2: pairwise alignment for nucleotide sequences. _Bioinformatics_, 34:3094–3100, <https://doi.org/10.1093/bioinformatics/bty191>
87+
- Qualimap:
88+
Okonechnikov K, Conesa A, García-Alcalde F (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. _Bioinformatics_, 32(2):292–294, <https://doi.org/10.1093/bioinformatics/btv566>
89+
- MultiQC:
90+
Ewels P, Magnusson M, Lundin S, Käller M (2016). MultiQC: summarize analysis results for multiple tools and samples in a single report. _Bioinformatics_, 32(19):3047–3048, <https://doi.org/10.1093/bioinformatics/btw354>
91+
- bedtools:
92+
Quinlan AR, Hall IM (2010). BEDTools: a flexible suite of utilities for comparing genomic features. _Bioinformatics_, 26(6):841–842, <https://doi.org/10.1093/bioinformatics/btq033>
93+
- Rscript:
94+
R Core Team (2019). R: A Language and Environment for Statistical Computing. <http://www.R-project.org/>

backmap.pl

Lines changed: 72 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@
55
use Cwd 'abs_path';
66
use IPC::Cmd qw[can_run run];
77
use Number::FormatEng qw(:all);
8+
use Parallel::Loops;
89

9-
my $version = "0.2";
10+
my $version = "0.3";
1011

1112
sub print_help{
1213
print STDOUT "\n";
@@ -270,7 +271,6 @@ sub round_format_pref{
270271
if(not -d "$out_dir"){
271272
print STDERR "INFO\tCreating output directory $out_dir\n";
272273
$mkdir_cmd="mkdir -p $out_dir";
273-
# exe_cmd($cmd,$verbose,$dry);
274274
}
275275

276276
if($prefix eq ""){
@@ -284,10 +284,6 @@ sub round_format_pref{
284284
}
285285

286286
if(scalar(@bam) > 0){
287-
if($prefix ne ""){
288-
print STDERR "INFO\tIgnoring -pre $prefix\n";
289-
$prefix = "";
290-
}
291287
if(defined(can_run("bedtools"))){
292288
if($create_histo_switch == 0 or $estimate_genome_size_switch == 0){
293289
$create_histo_switch = 1;
@@ -478,6 +474,15 @@ sub round_format_pref{
478474
chomp $rscript_version;
479475
}
480476

477+
my $multiqc_version;
478+
if(not defined(can_run("multiqc"))){
479+
$multiqc_version = "not detected";
480+
}
481+
else{
482+
$multiqc_version = `multiqc --version 2> /dev/null | awk '{print \$NF}'`;
483+
chomp $multiqc_version;
484+
}
485+
481486
my $verbose_word = "No";
482487
if($verbose == 1){
483488
$verbose_word = "Yes";
@@ -514,6 +519,7 @@ sub round_format_pref{
514519
print "qualimap: " . $qualimap_version . "\n";
515520
print "bedtools: " . $bedtools_version . "\n";
516521
print "Rscript: " . $rscript_version . "\n";
522+
print "multiqc: " . $multiqc_version . "\n";
517523
print "\n";
518524
print "User defined input\n";
519525
print "==================\n";
@@ -695,10 +701,22 @@ sub round_format_pref{
695701
my %cov_files;
696702
my %peak_cov;
697703
my %n0_all;
698-
my $global_ymax = 0;
704+
my @global_ymax = ();
705+
706+
my $maxProcs = scalar(@sorted_bams);
707+
if($threads < $maxProcs){
708+
$maxProcs = $threads;
709+
}
710+
699711
if($create_histo_switch == 1){
700712

701-
foreach(@sorted_bams){
713+
my $multiple_histos = Parallel::Loops->new($maxProcs);
714+
$multiple_histos->share(\%cov_files);
715+
$multiple_histos->share(\%peak_cov);
716+
$multiple_histos->share(\%n0_all);
717+
$multiple_histos->share(\@global_ymax);
718+
719+
$multiple_histos->foreach( \@sorted_bams, sub {
702720

703721
my $tech = "Illumina";
704722
if($_ =~ m/\.pb\.sort\.bam$/){
@@ -724,9 +742,7 @@ sub round_format_pref{
724742
$n0_all{$tech} = $n0;
725743
my $ymax = `sort -rgk2 $cov_hist_file | awk \'\$1!=0{print \$2}\' | head -1`;
726744
chomp $ymax;
727-
if($ymax > $global_ymax){
728-
$global_ymax = $ymax;
729-
}
745+
push(@global_ymax,$ymax);
730746

731747
open(R,'>',$cov_hist_file . ".plot.r") or die "ERROR\tCould not open file " . $cov_hist_file . "plot.r\n";
732748

@@ -753,20 +769,28 @@ sub round_format_pref{
753769
$cmd = "Rscript $cov_hist_file.plot.r > /dev/null 2> /dev/null";
754770
exe_cmd($cmd,$verbose,$dry);
755771
}
756-
}
772+
});
773+
774+
my @global_ymax = sort( {$b <=> $a} @global_ymax);
757775

758776
if(scalar(@sorted_bams) > 1){
777+
my $rscript = "$out_dir/plot.all.r";
778+
if($prefix ne ""){
779+
$rscript = "$out_dir/$prefix.plot.all.r";
780+
}
759781
if($dry == 0){
760782
my @techs = ("Illumina","PacBio","Nanopore");
761783

762-
open(RALL,'>',"$out_dir/plot.all.r") or die "ERROR\tCould not open file $out_dir/plot.all.r\n";
784+
open(RALL,'>',"$rscript") or die "ERROR\tCould not open file $rscript\n";
763785

764786
for(my $i = 0; $i < scalar(@techs); $i++){
765787
if(exists($cov_files{$techs[$i]})){
766788
print RALL "$techs[$i]=read.table(\"$cov_files{$techs[$i]}\")\n";
767789
}
768790
}
769-
print RALL "pdf(\"$out_dir/plot.all.pdf\")\n";
791+
my $pdf = $rscript;
792+
$pdf =~ s/r$/pdf/;
793+
print RALL "pdf(\"$pdf\")\n";
770794
my @legend = ();
771795
my @lty = ();
772796
my @col = ();
@@ -775,7 +799,7 @@ sub round_format_pref{
775799
push(@legend,"\"$techs[$i] N(0)=$n0_all{$techs[$i]}\"");
776800
push(@lty,"1");
777801
if($i == 0 and exists($cov_files{$techs[$i]})){
778-
print RALL "plot($techs[$i]\[,1],$techs[$i]\[,2],log=\"x\",type=\"l\",xlab=\"Coverage\",ylab=\"Count\",main=\"$assembly\",ylim=c(0,$global_ymax))\n";
802+
print RALL "plot($techs[$i]\[,1],$techs[$i]\[,2],log=\"x\",type=\"l\",xlab=\"Coverage\",ylab=\"Count\",main=\"$assembly\",ylim=c(0,$global_ymax[0]))\n";
779803
push(@col,"\"black\"");
780804
}
781805
if($i == 1 and exists($cov_files{$techs[$i]})){
@@ -794,11 +818,11 @@ sub round_format_pref{
794818
close RALL;
795819
}
796820
else{
797-
print STDERR "I would create $out_dir/plot.all.r\n";
821+
print STDERR "I would create $rscript\n";
798822
}
799823

800824
if(defined(can_run("Rscript"))){
801-
$cmd = "Rscript $out_dir/plot.all.r > /dev/null 2> /dev/null";
825+
$cmd = "Rscript $rscript > /dev/null 2> /dev/null";
802826
exe_cmd($cmd,$verbose,$dry);
803827
}
804828
}
@@ -807,13 +831,12 @@ sub round_format_pref{
807831

808832
if($estimate_genome_size_switch == 1){
809833

810-
if($dry == 0){
811-
print "\n";
812-
print "Output\n";
813-
print "======\n";
814-
}
834+
my %results;
835+
836+
my $multiple_genome_size = Parallel::Loops->new($maxProcs);
837+
$multiple_genome_size->share(\%results);
815838

816-
foreach(@sorted_bams){
839+
$multiple_genome_size->foreach( \@sorted_bams, sub {
817840
if($dry == 1){
818841
print STDERR "CMD\tsort -rgk2 $_.cov-hist | awk \'\$1!=0{print \$1}\' | head -1\n";
819842
}
@@ -822,6 +845,15 @@ sub round_format_pref{
822845
exe_cmd($cmd,$verbose,$dry);
823846

824847
if($dry == 0){
848+
849+
my $tech = "Illumina";
850+
if($_ =~ m/\.pb\.sort\.bam$/){
851+
$tech = "PacBio";
852+
}
853+
if($_ =~ m/\.ont\.sort\.bam$/){
854+
$tech = "Nanopore";
855+
}
856+
825857
my $assembly_length = 0;
826858
my $assemly_perc = 0;
827859
if($assembly ne ""){
@@ -836,30 +868,32 @@ sub round_format_pref{
836868
my $total_nucs = `grep "bases mapped (cigar):" $_.stats | awk -F'\\t' '{print \$3}'`;
837869
chomp $total_nucs;
838870

839-
my $tech = "Illumina";
840-
if($_ =~ m/\.pb\.sort\.bam$/){
841-
$tech = "PacBio";
842-
}
843-
if($_ =~ m/\.ont\.sort\.bam$/){
844-
$tech = "Nanopore";
845-
}
846-
847871
my $genome_size_estimate = $total_nucs / $peak_cov{$tech};
848-
849-
print $tech . " (" . $_ . ")\n";
850-
print "Mapped nucleotides: " . round_format_pref($total_nucs) . "b\n";
851-
print "Peak coverage: " . $peak_cov{$tech} . "\n";
852-
print "Genome size estimate: " . round_format_pref($genome_size_estimate) . "b\n";
872+
873+
$results{$tech} = $tech . " (" . $_ . ")\n" . "Mapped nucleotides: " . round_format_pref($total_nucs) . "b\n" . "Peak coverage: " . $peak_cov{$tech} . "\n" . "Genome size estimate: " . round_format_pref($genome_size_estimate) . "b\n";
853874
if($assembly_length > 0){
854875
$assemly_perc = sprintf("%.2f", ($assembly_length / $genome_size_estimate) * 100);
855-
print "Assembly length: " . round_format_pref($assembly_length) . "b ($assemly_perc% of estimate)\n";
876+
$results{$tech} = $results{$tech} . "Assembly length: " . round_format_pref($assembly_length) . "b ($assemly_perc% of estimate)\n";
856877
}
878+
857879
}
858880
else{
859881
print STDERR "CMD\tgrep \"bases mapped (cigar):\" $_.stats | awk -F\'\\t\' \'{print \$3}\'\n";
860882
}
861-
}
883+
});
862884

885+
if($dry == 0){
886+
print "\n";
887+
print "Output\n";
888+
print "======\n";
889+
890+
my @techs = ("Illumina","PacBio","Nanopore");
891+
for (my $i = 0; $i < scalar(@techs); $i++){
892+
if(exists($results{$techs[$i]})){
893+
print $results{$techs[$i]};
894+
}
895+
}
896+
}
863897
}
864898

865899
exit;

0 commit comments

Comments
 (0)