DADA2, truncation lengths and features number

benjjneb · November 21, 2017, 10:15pm

It is important to understand that DADA2 (actually, all currently known ASV methods I believe) is using repeated observations of the true biological sequence to distinguish real sequences from errors. Thus, for sequence variants to be detected, there must be at least 2 error-free reads from that sequence in the dataset, and the sensitivity of these methods to rare variants is constrained by the fraction of error-free reads in the data.

I haven't seen your data, but I am going to make the assumption that like almost all Illumina 2x300 sequencing data, there is a steep drop-off in quality of the reverse reads over the past hundred-or-so bases. As a result, the probability that the entire 300nt reverse read is error-free is very low, perhaps 1%. In comparison, when you truncate at 240nts (eg) that probability might be much higher, perhaps 30%. So when you did ASV inference on the entire 300nts reverse reads, you failed to detect any sequences that were at an abundance of less than 100! As you need 100 * 1% = 1 error-free-read to have any chance of detecting that variant. But when you truncated to 240, now you could detect all those variants that were present in 3-8 reads, and therefore has 2/3/4 error-free reads.

This is what you saw in your results: The 300nt results had significantly fewer variants because you lost all the low frequency variation.

What you should do is look at the quality profile of your data, and, while ensuring that you have enough sequence that your forward/reverse reads still overlap enough to merge, truncate off as much of the nucleotides that come after quality crashes as you can (gradual decline of quality is not a big deal, quality crashes are usually pretty obvious).

That does require doing some simple balancing yourself between those priorities (must maintain overlap, but get rid of post-crash bases) but usually it's pretty easy to pick something reasonably good. And if not, we are happy to help (but please post the quality profile of your reads and your amplicon setup in that case).