How does DADA2 handle the sequence with different truncation lengths?

Claire010 · October 7, 2020, 1:44am

Just for discussion, what if one read was truncated at 5'-end by one nucleotide due to the Qscore=2 criteria, but the rest of this read is exactly the same with another full-length read. DADA2 will still recognize them as different ASVs, right? If yes, wouldn't it be a bit weird?

ChrisKeefe · October 7, 2020, 3:21pm

Good question, @Claire010!
DADA2 attempts to correct sequencing errors, by developing an error model from your data, and then correcting likely bad nucleotides. Because of this, your truncated read is very likely to be corrected so that it is grouped with a more-prominent sequence (like the slightly longer sequence you described).

For more details, check out the DADA2 preprint, or click through to Nature Methods for the final publication if you have access.

Chris

Claire010 · October 7, 2020, 3:38pm

Hi Chris @ChrisKeefe ,

Thanks a lot for your prompt reply. DADA2 indeed has an algorithm to establish a model to correct nucleotide with low quality. However, this error correction happens after the truncation. Do you mean DADA2 can save back the truncated nucleotide? Then isn't nonsense to truncate the read at the position with Qscore=2 in DADA2?
Anyway, I will go through the paper again to see whether it covers this point.

ChrisKeefe · October 7, 2020, 9:25pm

@Claire010, I think you're right to go to the paper and/or DADA2 docs to find more concrete answers, as I haven't spent much time with how exactly DADA2 performs error correction. The Sequence Comparison, Abundance P-value, and Divisive Partitioning Algorithm sections are likely the most useful to you for this.

When we truncate the read at position(Q=2), we do so because we don't trust that read. DADA2 doesn't "save back" the truncated nucleotide, and isn't deciding that we do trust the read. DADA2 is happy to let the untrustworthy data go. It uses its error model to choose likely "true sequences", and replaces likely-erroneous sequences with the likely "true sequence" they most closely resemble.

Incidentally, I think DADA2 author @benjjneb generally recommends using the "base-wise" trim and trunc parameters, rather than trimming by q-score, as this allows DADA2 to rely on max-ee instead of raw q-scores.

system · November 8, 2020, 3:25am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.