Myself and colleagues have a question regarding DADA2 that we have not been able to find an answer to. If I understand DADA2 correctly, it makes assumption to build the error correction model: that the only sequence that is “real” (not an artifact of some kind from sequencing, PCR, etc…) is the ONE most abundant sequence in the dataset. It uses this sequence to determine the probability of error for all other sequences in the dataset. Is this correct?
For datasets where you have highly divergent sequences (i.e. large variation in phyla present, for example), is DADA2 then more likely to determine that this more divergent sequences from the most abundant sequence are “errors” and toss those sequences?
I do see in the paper that it says reads are partitioned (ones that are more divergent?), but then as you read the paper, it is not clear to me when that happens, or if this happens so more divergent reads are separated and error models determined from multiple starting points, rather than the one most abundant sequence in the dataset.
Most of the evaluation of DADA2 seems focused on its ability to discriminate sub-OTU level, down to one nucleotide difference. But I see less information on its ability to denoise reads for artifacts while retaining divergent sequences from abundant organisms.
I have read the papers regarding DADA2, including independent evaluation and comparisons of DADA2, Deblur, OTUs, etc…I have also listened to Ben Callahan discuss DADA2 in an interview, and I emailed Ben Callahan directly, but did not get a reply (I am sure he is very busy). I thought I would ask QIIME2 forum. Thanks for any help or information.
Hope the question makes sense!