DADA2, truncation lengths and features number

mchialva · November 21, 2017, 2:50pm

Hi everybody,
I'm processing MiSeq libraries (2x300bp) on V3-V4 16S region with DADA2 and I tried different truncation length (--p-trunc-len-f and --p-trunc-len-r) and I was surprised by results. I first tried to truncate at 300 but algorithm returned an error. Then, I tried reads truncation at 0 (no truncation), 285, and 260 obtaining an increasing number of features (132, 172 and 462 respectively).

My questions are:

Why does no sequence truncation (truncation=0) leads to fewer features than truncation at 285 bp? I would expect that features number would be inversely proportional to the truncation length.
What analysis I can trust? I tried taxonomical classification of reads truncated at 0 and 260 using naive-Bayesian algorithm and SILVA database, and number of assigned features (excluding Unassigned) was really different (102 vs 320 features with most of the 102 features also detected in the second analysis).

Other DADA2 parameters were set as following:

qiime dada2 denoise-paired
--i-demultiplexed-seqs qiime2_import.qza
--o-table table_dada2.qza
--o-representative-sequences rep_seqs_dada2.qza
--p-trim-left-f 13
--p-trim-left-r 13
--p-n-threads 25

Thanks for the support!

colinbrislawn · November 21, 2017, 3:27pm

I would like to hear this answer too. Particularly, why shorter reads produce more features with dada2. I wonder if the greater trimming reduces the errors, leading to more resolved amplicons...

benjjneb · November 21, 2017, 10:15pm

It is important to understand that DADA2 (actually, all currently known ASV methods I believe) is using repeated observations of the true biological sequence to distinguish real sequences from errors. Thus, for sequence variants to be detected, there must be at least 2 error-free reads from that sequence in the dataset, and the sensitivity of these methods to rare variants is constrained by the fraction of error-free reads in the data.

I haven't seen your data, but I am going to make the assumption that like almost all Illumina 2x300 sequencing data, there is a steep drop-off in quality of the reverse reads over the past hundred-or-so bases. As a result, the probability that the entire 300nt reverse read is error-free is very low, perhaps 1%. In comparison, when you truncate at 240nts (eg) that probability might be much higher, perhaps 30%. So when you did ASV inference on the entire 300nts reverse reads, you failed to detect any sequences that were at an abundance of less than 100! As you need 100 * 1% = 1 error-free-read to have any chance of detecting that variant. But when you truncated to 240, now you could detect all those variants that were present in 3-8 reads, and therefore has 2/3/4 error-free reads.

This is what you saw in your results: The 300nt results had significantly fewer variants because you lost all the low frequency variation.

What you should do is look at the quality profile of your data, and, while ensuring that you have enough sequence that your forward/reverse reads still overlap enough to merge, truncate off as much of the nucleotides that come after quality crashes as you can (gradual decline of quality is not a big deal, quality crashes are usually pretty obvious).

That does require doing some simple balancing yourself between those priorities (must maintain overlap, but get rid of post-crash bases) but usually it's pretty easy to pick something reasonably good. And if not, we are happy to help (but please post the quality profile of your reads and your amplicon setup in that case).

mchialva · November 22, 2017, 12:34pm

Thanks a lot for your really clear explanation. As you suggested I look at quality profiles and I observed a consistent drop-off in quality at reverse reads. My amplicons are 341f-805r so should be about 460 bp.

Looking at profiles I would truncate reverse reads at 220-240 but I'm not sure what is the minimum acceptable overlap between reads. 50-60 bp might make sense? A different left reads and right reads truncation value would be acceptable?
Thanks again!

benjjneb · November 22, 2017, 2:32pm

Minimum overlap is 20 nts for merging, but you need to factor in some biological length variation in the amplicon size as well, so I'd say at least 30nts of overlap for V3/V4.

Not just acceptable, usually the right choice. Here the reverse reads are clearly worse quality, so should be truncated earlier than the forward reads. Perhaps trunc-len-f 280 and trunc-len-r 220? That gives 500 total nts after truncation, so still safely overlapping, and gets rid of the worst quality region.

mchialva · November 22, 2017, 2:45pm

Thanks a lot, now everything makes so much more sense!

system · December 23, 2017, 8:45pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.