-
Notifications
You must be signed in to change notification settings - Fork 162
Description
One feature that sets DADA2 apart from other algorithms is that the recommended workflow is to denoise forward and reverse reads separately, and then merge them. This has always seemed a bit weird to me, for the reason that you don't look at the whole sequence at once. Imagine two species with similar barcodes, except for one basepair close to the 3´ end, outside of the overlapping region. This means that only the reverse reads will capture the difference between these two species, but the forward reads will be identical. Assuming DADA2 works as expected, it will thus denoise the reverse reads into two separate ASVs for these species, but the corresponding forward reads will be indistinguishable, and thus inferred to be one single ASV. How mergePairs deals with this situation I do not understand at all.
To recap the reason for this situation: DADA2 is the only available algorithm that incorporates the quality scores into denoising, which as I understand it is one of its strengths. It means though that in case forward and reverse reads are merged prior to denoising, the quality scores in the overlapping region will be recalibrated by all existing merging programs to artificially high values, creating different conditions in the overlapping region and the 5´ and 3´ regions. To be honest, I've never really understood why this is a problem, other than that it "messes with the DADA2 algorithm", but anyway here we go.
With this reasoning, it appears that the only thing hindering DADA2 from working on merged read pairs is the Q score recalibration. If there had been a Q scoring scheme that simply aligned the read pair, and at every position in the overlapping region picked the base with the highest Q score and simply copied its Q score unmodified, then the problem would be solved.
Well, as it turns out, there already exists one such program, and it's called CASPER. Here is a plot showing the same sample first as raw forward and reverse reads (after primer removal with cutadapt), and merged using PEAR (which recalibrates Q scores, which are capped at Q40 in this case) and CASPER (which doesn't recalibrate Q scores). They could probably be refined further with some length filtering, but anyway my question is: Would you rather go on to denoise samples merged without Q score recalibration, than forward and reverse reads separately, as is the current practice? And I mean conceptually, ignoring the potential quirks of CASPER specifically.
Also, could you possibly explain a little further how the recalibrated high Q scores actually constitutes a problem for DADA2? If the reads are merged, then the overlapping region has been subjected to mismatch correction, so some level of "denoising" has already taken place there. Meaning the conditions in the overlapping and flanking regions are indeed different, as the recalibrated quality scores are trying to reflect.
I could add also, that in my small test, the merge-first-approach created on average more ASVs than than the denoise-first-approach, although most of them low-abundance. Interesting to note though that both methods inferred some ASVs that the other one missed, and not all were low-abundance.
There is a paper by the way, which compares different denoising/clustering algorithms on a highly diverse dataset, and where DADA2 stands out (Nilsen et al, 2024). The authors also place their suspicions on the unique denoise-first-approach:
The elevated variation observed for DADA2 across the replicates was surprising, as DADA2 represents the most computer intensive approach and was expected to perform better than the other methods in removing technical variation [3]. It is beyond the scope of this work to dig into the details of the algorithms behind these methods, but what clearly separates the DADA2 from the other three is how paired-end reads are merged. Both Deblur, UNOISE, and Swarm rely on merging the pairs early and then denoise. In DADA2, the forward and reverse reads are denoised separately and then finally merged without taking into account the pairwise information inherent in paired-end reads. In high diversity samples, this strategy may cause problems if the actual true sequences are very similar.