DADA2: Decreasing feature number as more sequences are maintained

Hello everyone,

I am working through the QIIME/microbiome sequence process for the first time and I've gotten hung up on a few questions on my DADA2 results.

I am working with 2x251 paired end sequences of the V4 region, with 515F and 806R primers. The sequencing followed the EMP protocol, so the primers were removed before I received the multiplexed data. After demultiplexing my data I have tested a number of trimming and truncation parameters after reading many of the great discussions on this forum. To my novice eye, my demux graphs a generally pretty good, with the reverse read being better quality than the forward, and there is some variability in quality in the middle of the reads. demux.qzv (325.4 KB)

I saw in other discussions that I should focus on "crashes" in quality rather than smaller dips. With this in mind I tried a DADA2 run with these parameters in qiime2-2021.4...

--p-trunc-len-f 220
--p-trunc-len-r 207
dada2-table-8.qzv (3.2 MB) denoising-stats-8.qzv (1.2 MB)

I also tried a run where I trimmed from the 5' end and truncated more from the 3' end.

--p-trim-left-f 14
--p-trunc-len-f 153
--p-trunc-len-r 222
dada2-table-10.qzv (3.1 MB) denoising-stats-10.qzv (1.2 MB)

In the second run I am retaining more sequences after filtering/merging/chimeric removal but have a lower number of features. I'm not sure why I have this result and which of the DADA2 runs is closer to what is actually in the samples. Does anyone have a sense of which run I should move forward with, and why? And, can you trim too much from your reads and influence the identification/designation of features?

Thanks!!

Hi Mike,

Thanks for reaching out, happy to provide some insight on these dada2 results!

I first wanted to discuss the difference between p-trunc-len and p-trim (as this may help clarify your differing results):

p-trunc-len: Position at which read sequences (forward or reverse) should be truncated due to decrease in quality. This truncates the 3' end of the of the input sequences, which will be the bases that were sequenced in the last cycles. Reads that are shorter than this value will be discarded. After this parameter is applied there must still be at least a 12 nucleotide overlap between the forward and reverse reads. If 0 is provided, no truncation or length filtering will be performed.

p-trim: Position at which read sequences (forward or reverse) should be trimmed due to low quality. This trims the 5' end of the input sequences, which will be the bases that were sequenced in the first cycles.

The primary difference between these two parameters is that p-trunc-len will trim your reads (forward and/or reverse) on the 3' end (or the 'right' side), while p-trim will trim your reads (forward and/or reverse) on the 5' end (or the 'left' side). I've included a couple of graphics below that display where you've trimmed your sequences for your first and second runs.

In your second run, your forward reads have (on average) a higher quality score for the trimmed region that you de-noised - while both forward and reverse reads for your first run have a lower average quality score.

I suspect that the lower number of features in your second run is actually a good sign - you input higher quality data overall, resulting in sequences with less variation (so most likely these are actual features vs. maybe some erroneous 'features' that show up in lower quality data).

To address this question, I'd refer back to the interactive quality plot from view.qiime2.org - I'd use that as your guide (for all of your denoising runs now and moving forward), and try trimming your sequences wherever you're seeing a general drop in your average quality score.

Hopefully this helps inform your p-trim and p-trunc-len choices for this dataset and moving forward - let me know if you need further clarification on anything!

Cheers,
Liz

4 Likes

Hi Liz,

Thank you for the reply! And also clarifying the DADA2 process. This is very helpful with determining the areas to trim/truncate and why I have differing feature numbers between runs.

I really appreciate the assistance.

Best,
Mike

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.