dada2: trunc len determine

YuZhang · June 2, 2021, 3:46pm

Dear all，

I am analyzing the bacterial 16S data.
My primer is :
338F_806R
ACTCCTACGGGAGGCAGCAG
GGACTACHVGGGTWTCTAAT

My data was remove the primer already. Then the trimmed data is in the bottom.

My reverse sequences were poor than forward.
Now I have some questions on dada2 trunc len parameter set.
Should I use the Q20 to determine the parameter setting? Is bottom of box or middle of box in plot?

I want to use code like this:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 262
--p-trunc-len-r 188
--p-n-threads 40
--o-representative-sequences rep-seq-dada2.qza \

Are they appropriate?

Ahmed-Sidahmed · June 2, 2021, 6:11pm

Hi YuZhang,

Reverse reads are normally worse in quality then forward reads, at least in illumina chemistry. It looks like you are amplifying the V3-V4 region which is ~460bp. You may need to tweak your truncation parameters in order to meet DADA2's requirement of having at least 20bp of overlap.

YuZhang · June 3, 2021, 12:56am

Why is 460? I think the insert DNA should be 806-338-20*2= 428. From the Forum,It should retain wo bp overlap at least. Thus,the left trunc len +right trunc len,should be 458. Is it true?
But, I don’ know the which value determine the trunc len in quality score box plot. Quality score 20 middle of box?

In other words, how do I to determine the parameters through the table?

YuZhang · June 3, 2021, 12:37pm

I tried many times in different parameters,but they all lost too many sequences,and only 50% non chimeric left.
I tried followed parameter:

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 269
--p-trunc-len-r 189
--p-n-threads 40
--o-representative-sequences rep-seq-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza
qiime metadata tabulate
--m-input-file stats-dada2.qza
--o-visualization stats-dada2.qzv

stats-dada2.qzv (1.2 MB)

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 264
--p-trunc-len-r 200
--p-n-threads 40
--o-representative-sequences rep-seq-dada2-1.qza
--o-table table-dada2-1.qza
--o-denoising-stats stats-dada2-1.qza
qiime metadata tabulate
--m-input-file stats-dada2-1.qza
--o-visualization stats-dada2-1.qzv
stats-dada2-1.qzv (1.2 MB)

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 269
--p-trunc-len-r 176
--p-n-threads 40
--o-representative-sequences rep-seq-dada2-2.qza
--o-table table-dada2-2.qza
--o-denoising-stats stats-dada2-2.qza
qiime metadata tabulate
--m-input-file stats-dada2-2.qza
--o-visualization stats-dada2-2.qzv

stats-dada2-2.qzv (1.2 MB)

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 264
--p-trunc-len-r 176
--p-n-threads 40
--o-representative-sequences rep-seq-dada2-3.qza
--o-table table-dada2-3.qza
--o-denoising-stats stats-dada2-3.qza
qiime metadata tabulate
--m-input-file stats-dada2-3.qza
--o-visualization stats-dada2-3.qzv

stats-dada2-3.qzv (1.2 MB)

qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-trimmed.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 227
--p-trunc-len-r 227
--p-n-threads 40
--o-representative-sequences rep-seq-dada2-4.qza
--o-table table-dada2-4.qza
--o-denoising-stats stats-dada2-4.qza
qiime metadata tabulate
--m-input-file stats-dada2-4.qza
--o-visualization stats-dada2-4.qzv

stats-dada2-4.qzv (1.2 MB)

So,what should I do?

ChrisKeefe · June 3, 2021, 7:07pm

@YuZhang , DADA2 only requires 12 bp of overlap (20 was a requirement a long time ago). Your target amplicon is 806-338=468 bp long. You'll need 12 bp of overlap, so 468+12 = 480. And most 16s amplicons naturally vary by a few bp in length, so it would probably be safest if the length of your truncated f and r reads summed to 483.

The parameters you select will vary depending on your study. Ideally, I aim to truncate when the median q score drops below 30. For high-quality, high-biomass, samples, this is often possible, but your data may not allow that.

If you include positions with low quality scores, you will lose reads during the filtering step. If you truncate too much, you will lose reads during the merging step, because your reads won't join. If you're unclear on this, spend some time reading about choosing DADA2 parameters here on this forum, and look at the original paper preprint, linked in our previous discussion of DADA2.

YuZhang · June 4, 2021, 1:07am

Thank you for your patient answer！I already read most of post related this topic.
Now, I still have some question.
The target amplicon is not 806-338-foward primer- reverse forward=468 bp -20*20bp= 428？
my sequences alread cut the roward and reverse primers. If there is only 12 bp overlap need，so trunc left +trunc right > 440?

Please see these code. It also ends up low sequences.
So what should I do?

ChrisKeefe · June 4, 2021, 3:31pm

Based on your DADA2 stats, and what you have learned about setting DADA2 parameters, why do you think you are losing so many sequences?

YuZhang · June 5, 2021, 12:58am

Sorry，I don't know how could be this. Because I think 440 bp trunc could merge my sequences.806-338-foward primer- reverse forward=468 bp -20*20bp= 428？ Is it right？May be the data quality is poor？But the quality score reached 30. I tried many times to find the appropriate parameters, but it didnt' t work. Could you give me some advice to try ？

ChrisKeefe · June 5, 2021, 3:34am

Do you understand how to interpret the Dada2 stats you shared? That will tell you where in the denoising process you are losing reads, and help you find a solution.

YuZhang · June 6, 2021, 12:20am

I think I probably do.

For example，the first sample，TS2, the sequences of TS2 lose about 17.63% of sequences (left 82.37% in the "filter stage", due to the removal of some sequences with NS or more than two expected errors; then lose about 24% in "merged step", due to low quality of low match.
Is right?

ChrisKeefe · June 7, 2021, 10:57pm

Thanks for explaining, and thank you for your patience, @YuZhang. I think you have a solid understanding of what's going on here. Depending on your study, you may be suffering from a desire to optimize unnecessarily.

Results so far

Of the examples you show above, your best parameters look like the ones shown in stats-dada2-2.qzv, where you capture almost 50% of your sequences. Why? You retain the most data, and probably don't bias your data in the merge process.

Always pad your estimated amplicon length

One important insight from this table: when your sequence length drops to 440, you lose many more sequences in merging.

Your math here looks correct to me (sorry for any confusion I may have caused above - I overlooked that you had pre-trimmed primers). However,
many of your sequences are probably slightly longer than 428 bp. If you truncate to a total length of 428 + 12 = 440, these sequences will fail to merge.

Failing to account for sequence length variation can bias your data by dropping any sequences that are naturally a little longer, potentially disproportionately impacting certain taxa. A literature search may be necessary to figure out how much variation in length to expect in between 338 and 806. If in doubt, it's probably better to lose sequences to filtering than to merge failure.

Is this good enough?

With your best parameter set above, you have over 18,000 sequences in the shallowest of your samples. For many studies, that's more than enough depth for successful analysis.

Sequence attrition from denoising can be stressful, but it's important to remember what we're doing during this process. Quality filtering drops untrustworthy reads. Denoising with DADA2 corrects badly-read positions, and chimera-checking removes sequencing chimeras. This conservative approach helps remove artifactual taxa from your data, and generally leaves you with a better representation of the biological community you're studying at the "cost" of a bunch of untrustworthy reads.

Can you do better?

Maybe! If your study requires greater sequencing depth, you can continue to tweak these parameters until you get a better retention rate, at the cost of work and compute time. Focus on truncating the lowest-quality positions while preserving an adequate overall length and you'll be fine.

You could also denoise only your forward reads, and trade read length for some additional sequencing depth by skipping the merge process. If you're losing 25% in merging, that could be a significant improvement, but only if you don't need your full target amplicon.

Though I don't recommend it in the general case, it is possible to make DADA2 run more permissively if your data requires it. The q2-dada2 and DADA2 docs will be your best guides, along with forum posts here. Again, DADA2 has sensible defaults for most 16s work, so consider whether your study needs this.

Good luck, and let us know how everything goes!
Chris

YuZhang · June 11, 2021, 1:55am

Thanks! Thank you for your patience.

Because, this is my second time to analysis 16s dada in Qiime2.
I always worried I would overlook some must steps, because of my lack of knowledge. So I tried many times.
I finally choiced the follwed parameters due to the highest percentage of input non-chimerc.stats-dada2-6.qzv (1.2 MB)

qiime dada2 denoise-paired \ --i-demultiplexed-seqs demux-trimmed.qza \ --p-trim-left-f 0 \ --p-trim-left-r 0 \ --p-trunc-len-f 257 \ --p-trunc-len-r 187 \ --p-n-threads 40 \ --o-representative-sequences rep-seq-dada2-6.qza \ --o-table table-dada2-6.qza \ --o-denoising-stats stats-dada2-6.qza

In addition,I'm not sure which column the "filter", "merge" and "chime" are based on in this table. Because the data is not quite the same as what I counted. Could you explain it in detail ?

ChrisKeefe · June 11, 2021, 4:10pm

Glad you've found some parameters that are working well for you!

Those columns are just the mean of the values in the "percentage of input passed filter" etc columns of your dada2-stats. I checked median for a couple of them too, and they were pretty close.

YuZhang · June 13, 2021, 1:04am

Thanks! I always check the maximum. I think this is also right?

ChrisKeefe · June 14, 2021, 5:25pm

Depends on what you're going for, I guess? I checked mean and median because I wanted a representation of the impact on all of your samples, but I am not generally that systematic about it.

My goal in setting these parameters isn't to achieve the "best possible" retention, it's to achieve "good enough" retention quickly and then move on with my analysis.

YuZhang · June 15, 2021, 12:24am

Thanks sir, I got it.