-
Notifications
You must be signed in to change notification settings - Fork 162
Description
I am working on validating the DADA2 pipeline's accuracy using simulated microbial community data. With simulated data, the "ground truth" identity of every single read is encoded directly in its original FASTQ file header (the sequence ID). My primary requirement is to obtain a simple, clean mapping that links the original input read to the final Amplicon Sequence Variant (ASV) identity determined by DADA2. Specifically, I need a reliable way to get this relationship for every single read that entered the pipeline (after trimming and filtering). This mapping is essential for calculating accurate performance metrics, such as the Adjusted Rand Index (ARI), where the output clustering (the ASV) must be compared against the known true identity (the Header) on a read-by-read basis. Crucially, this validation focuses solely on the clustering accuracy (sequence grouping) and does not rely on downstream taxonomic assignment.
Is there an existing, documented method or an internal DADA2 utility function that can provide this direct mapping table after the denoising, merging, and chimera-removal steps are complete?
Thanks.