DADA2: Decreasing feature number as more sequences are maintained

lizgehret · June 9, 2021, 6:53pm

Hi Mike,

Thanks for reaching out, happy to provide some insight on these dada2 results!

I first wanted to discuss the difference between p-trunc-len and p-trim (as this may help clarify your differing results):

p-trunc-len: Position at which read sequences (forward or reverse) should be truncated due to decrease in quality. This truncates the 3' end of the of the input sequences, which will be the bases that were sequenced in the last cycles. Reads that are shorter than this value will be discarded. After this parameter is applied there must still be at least a 12 nucleotide overlap between the forward and reverse reads. If 0 is provided, no truncation or length filtering will be performed.

p-trim: Position at which read sequences (forward or reverse) should be trimmed due to low quality. This trims the 5' end of the input sequences, which will be the bases that were sequenced in the first cycles.

The primary difference between these two parameters is that p-trunc-len will trim your reads (forward and/or reverse) on the 3' end (or the 'right' side), while p-trim will trim your reads (forward and/or reverse) on the 5' end (or the 'left' side). I've included a couple of graphics below that display where you've trimmed your sequences for your first and second runs.

In your second run, your forward reads have (on average) a higher quality score for the trimmed region that you de-noised - while both forward and reverse reads for your first run have a lower average quality score.

I suspect that the lower number of features in your second run is actually a good sign - you input higher quality data overall, resulting in sequences with less variation (so most likely these are actual features vs. maybe some erroneous 'features' that show up in lower quality data).

To address this question, I'd refer back to the interactive quality plot from view.qiime2.org - I'd use that as your guide (for all of your denoising runs now and moving forward), and try trimming your sequences wherever you're seeing a general drop in your average quality score.

Hopefully this helps inform your p-trim and p-trunc-len choices for this dataset and moving forward - let me know if you need further clarification on anything!

Cheers,
Liz