dada2 Parameters: Understanding and Repercussions

(Todd Testerman) #1

Hello,

I was hoping someone would be able to check my understanding of some of the dada2 parameters commonly used. Our lab is attempting to standardize how we are treating our data and I think ensuring we know the effects of our filtering parameters is important before we can do that. I’m going to lay out a few scenarios and what I interpret to be the repercussions for setting the specific parameters. I included default parameters as well to discuss how they are affecting the process. We will assume all data is paired end, 250x250, and barcodes have been removed already. Amplicon is 16S V4 region (515F-806R), an approximately 253 bp amplicon without barcodes.

High Quality R1 and R2 Data (Median Q-Score does not drop below 30 until base 251

Command Used: qiime dada2 denoise-paired --i-demultiplexed-seqs demux-paired-end.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 250 --p-trunc-len-r 250 --p-trunc-q 2 --p-max-ee 2--o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

Outcome: All sequences will have a minimum length of 250bp. If quality of base drops below Q2 before reaching base 250, entire sequence will be discarded. If more than 2 errors are detected across the entire sequence length (as determined by average quality score across all read bases), read will be thrown out. Overall - Very restrictive length requirement leaving high quality sequences with generous amount of overlap. High quality data set will likely prevent many reads from being thrown out even with length requirement.

Command Used: qiime dada2 denoise-paired --i-demultiplexed-seqs demux-paired-end.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f **250** --p-trunc-len-r **200** --p-trunc-q 2 --p-max-ee 2 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

Outcome: All forward sequences will have length of 250 and reverse sequences will have length of 200. If quality of base drops below Q2 before reaching specified length, read is thrown out. If more than 2 errors are detected across the entire sequence length (as determined by average quality score across all read bases), read will be thrown out. Overall - Similar to previous scenario though perhaps slightly more reverse reads would be kept as highest quality portion of the read is maintained (meaning less likelihood that the 2 error default threshold would be hit).

Medium Quality R1 and R2 Data (Median Q-Score begins to drop off around position 220)

Command Used: qiime dada2 denoise-paired --i-demultiplexed-seqs demux-paired-end.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 250 --p-trunc-len-r 250 --p-trunc-q 2 --p-max-ee 2--o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

Outcome: Same command used as in first scenario. However, this time more data might be tossed due to deteriorating quality scores. For example, a read has quality drop to Q2 at base 245, read would be truncated, fall below set threshold and be thrown away. Additionally, the inclusion of all 250 bp in the ideal read length will drive up expected error rate, possibly causing it be thrown out for having more than 2 expected errors. Overall - May lose large portion of data depending on how poor quality gets within final 50 bp region.

Command Used: qiime dada2 denoise-paired --i-demultiplexed-seqs demux-paired-end.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f **240** --p-trunc-len-r **200** --p-trunc-q 2 --p-max-ee 2 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

Outcome: Truncation parameters shorten forward reads to 240 bp and reverse reads to 200 bp. These new parameters will lessen the impact of late poor quality bases on error filter (max-ee 2), leading to less reads lost. Additionally, bases with Phred score of 2 or less in the lower quality portions of the reads will be less likely to cause read removal due to inability to reach read length requirement. Overall- Improved set of parameters for this data set. Less reads will be tossed due to classic tailing off of quality. Reads will still be able to be paired and low quality regions will be further polished by the merging process.

I realize this may have come across as a somewhat repetitive exercise but I do think it’s important to understand these steps. Most of the time I will see suggestions like “set truncation parameters when you see quality trail off in quality plots” and while that definitely makes sense, I am hoping to be able to drill down on more concrete recommendations. Something like “When median Q-score drops below 30 for two consecutive bases, set truncation parameters there” or something similar. Would love to hear anyone’s thoughts on this and if my interpretations of the above scenarios need corrected!

Thanks,

Todd