Dada2 denoising clarifications

Mike26 · August 17, 2017, 1:02pm

Hi Good day. I am currently analysing 6 samples using qiime2. I previously analyzed my data using qiime 1.9.1 however, due to several papers discussing the advantage of using amplicon sequence variants instead of OTUs, i re-run my analysis as i would like to see it for my self. my data were sequenced on Illumina Miseq 300PE with the initial count below.
Sample 1 220, 051
Sample 2 213,886
Sample 3 175,420
Sample 4 143,264
Sample 5 115,665
Sample 6 96,526

after importing my paired-end sequences, I used dada2 denoised paired with the following parameters:

qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux.qza \
  --o-table table.qza \
  --o-representative-sequences rep-seqs.qza \
  --p-trim-left-f 0 \
 --p-trim-left-r 0 \
--p-trunc-len-f 280 \
 --p-trunc-len-r 280

However, after viewing the summary of feature table frequency (OTU table biom) and feature data sequence (rep sequences). i gives me the following counts:

no of features: 365
total no of frequency: 17,333

Sample 1- 3, 868
Sample 2 - 3,622
Sample 3 - 2,918
Sample 4 - 2,400
Sample 5 - 2,319
Sample 6 - 2,206

Is the 17, 333 the total number of ASV or sequences that has been generated by dada2? i found it reaaly small as compared with my initial analysis with qiime1 were i got 623, 435 sequences after quality filtering.

or the 17, 333 is just the total number of unique ASV that has been group together as similar with each other? if so, is there a way where i can find the total number of sequences that has generated by dada2?

thank you so much and i hope some can enlighten me.

thermokarst · August 17, 2017, 10:06pm

Hi @Mike26!

The sequence variants are all of your representative sequences, after quality control. The reduction in sequence counts you have shared look pretty similar to what we have seen when using DADA2! Generally speaking, we are seeing fewer sequence variants (you can think of these as 100% OTUs), because the quality control in DADA2 does a great job at sequence error correction. You can learn more about how DADA2 works here! The great thing about this is we are generally seeing higher resolution data, and less of it, so it makes downstream processing even faster and easier!

I will defer to @benjjneb, @gregcaporaso, or @jairideout to fill in any gaps in this statement. Thanks!

benjjneb · August 18, 2017, 1:53am

As @thermokarst said, it is expected that dada2 will produce fewer ASVs than many previous methods produced OTUs. However, I'm not sure that's exaclty what's going on here, can you clarify what command produced the following output (or what it means exaclty):

If you are getting a very low percentage of your input reads through to the end (i.e. you have 600k reads in and only 17k out) then there is a problem, probably related to your filtering/trimming parameters.

Right now the dada2 command is performing an entire workflow, including filtering, and you may be losing most of your reads at the filtering stage. Could you post the quality profile plot of your data (obtained from demux summarize)?

Mike26 · August 18, 2017, 2:37pm

Hi @benjjneb @thermokarst yes. I realize that the primer sequences are still present at the start of my data. Hence i re-run denoising using the command below. However, the reverse sequences are of really of low quality and I want to cut my sequences from 250-260bp. Sequences between 250-260 are still low quality but i am quite afraid that if i cut a little bit more my sequences will not be long enough for merging. Do you think it is okay to cut my sequences a little bit more? i post here the quality profile plot oy my data.

qiime dada2 denoise-paired
--i-demultiplexed-seqs importPE.qza 
--o-table denoisetable.qza 
--o-representative-sequences denoiserepseq.qza 
--p-trim-left-f 17 
--p-trim-left-r 21 
--p-trunc-len-f 260 
--p-trunc-len-r 260 
--p-n-threads 0

After denoising, I run the following command and seems i got higher counts (324, 637) of ASV compared previously (17,333).

qiime feature-table summarize 
--i-table denoisetable.qza 
--o-visualization denoisetable.qzv 
--m-sample-metadata-file metadatamappingfile.tsv

qiime feature-table tabulate-seqs 
--i-data denoiserepseq.qza 
--o-visualization denoiserepseq.qzv

No of features: 1,228
Total number of frequency: 324, 637

Thank you.

benjjneb · August 18, 2017, 3:25pm

That looks much better and based on what you posted, that now 50% of reads are making it through the full pipeline (filtering+denoising+merging+chimera removal) I think the results are likely reasonable. You can try to relax parameters more to get more reads through, but it is quite likely to be counter-productive as the additional data will be of lower quality. (one tiny thing, you don't have to have the same trunc-len for F and R reads)

Since I imagine a number of people will run into the same initial issue you hit, its worth reiterating the two key points that caused your initial denoising to fail to get most reads through.

For ASV methods, removing the primers is critical. The ambiguous nucleotides in primer regions are seen as real variation by ASV methods (whereas you could mostly get away with it when using fuzzier OTUs). Failure to remove primers causes many, even most, reads to be lost to the chimera removal step, due to chimeric models formed between alternate primer versions and the actual sequence.
For ASV methods, it is almost always advised to trim off the sequence after quality scores crash. This was a good idea with OTU methods, but it is even more critical for ASV methods, as these rely on repeated observations of the complete error-free sequence. The more post-quality-crash tail that is included, the lower the error-free-read fraction gets, which in turn hurts sensitivity to lower frequency variants.

Mike26 · August 28, 2017, 3:25am

Hi All thanks for the reply. Currently out of the lab for fieldwork hence was not able to reply sooner. I just had another clarification.

If i trim the primer sequences (~20bp), forward seqs to 290bp and reverse seqs to 250, that would leave me with 230bp reverse and 270bp forward sequences, right? However, I read in the Qiime 2 forum somewhere that DADA2 require a minimum of 20nts overlap + the natural length of the V3V4 region (460bp). Following this requirement of DADA2, I imagine my sequences would be like the one shown below.

(~250bp) 20bp overlap

                                  __ ________________________ (~210bp)

This would give me ~20nts overlap and 460bp length of the V3V4 region. Do I imagine this correctly? Or would you suggest having longer sequences to make sure that I will have enough overlap?

Thank you all in advance.

ebolyen · August 28, 2017, 10:36pm

You have the right idea. I would give it a shot and see how it goes.

If you want to be really sure of things, you could also compare the feature frequencies produced by running denoise-single on your forward reads to your merged feature frequencies from denoise-paired to get a feel for how many of your reads are getting merged successfully.

system · September 29, 2017, 4:36am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.