How can DADA2 denoising steps giving the same result from qiime1 split_libraries tool


(Shuqi Li) #1

Hi All,

I was trying to repeat a published work by using qiime2.

I have been trying for weeks and it has really been poor efficiency. I would really appreciate it a lot if someone could kindly guide me a little.

This is a sentence saying "Raw forward and reverse reads were aligned using fastq-join (32) and combined into a single fastq file using the split_libraries tool, which truncates reads with three consecutive base calls that exhibit a Phred score below 19. In total, 8,839,451 sequences (61.57% of total) were assembled and deemed passable following quality filtering.

The command I have used are:
qiime tools import \ --type 'SampleData[PairedEndSequencesWithQuality]' \ --input-path original-data-set \ --input-format CasavaOneEightSingleLanePerSampleDirFmt \ --output-path demux.qza qiime demux summarize
–i-data demux.qza
–o-visualization demux.qzv
$ qiime dada2 denoise-paired
–i-demultiplexed-seqs demux.qza
–p-trunc-len-f 0
–p-trunc-len-r 0
–p-trunc-q 19
–o-table table.qza
–o-representative-sequences rep-seqs.qza
–o-denoising-stats denoising-stats.qza

My question would be as below:

  1. how to do “truncates reads with 3 consecutive base calls”?

  2. how can I see the right per cent of sequences were assembled after the dada2 quality filtering, as the example shown as 61.57%?

  3. I also found there is this command embedded in DADA2, --p-max-ee, what is the function of this command? I have read the explanation in the user support book but still cannot understand. Could someone give a example as illustration?

  4. Also, I have tried the dada2 denoising step for several times, and the results are not all the same, sometimes it returned successfully completed, the other times it will takes much longer time and still failed. Is there a certain reason and a better solution to this?

Thank you for your help!

Many thanks again!


(Nicholas Bokulich) #2

(Nicholas Bokulich) #3

Hi @Shuqi,

I will point out that that paper used QIIME 1 for analysis. QIIME 2 pipelines will yield different results, particularly if you use denoising methods like dada2 instead of OTU clustering. Of course that is the whole point of doing a re-analysis with QIIME 2 — to get a more “nuanced” look (e.g., if you hypothesize that denoising methods will give a clearer view, eliminate some noise) — but from your questions it sounds like you may be attempting to replicate the results. If you want absolute replication, use the same methods the authors did.

See the q2-quality-filtering plugin.

dada2 outputs a stats file containing the number of merged sequences per sample. Or use qiime feature-table summarize to get a total count. Compare that to the sequence count from the qiime demux summarize visualization.

There is also a dada2 tutorial here that explains this and other parameters.

By failed do you mean that it is raising an error? If so, please report this error separately to get support for that question (that is a technical support, not user support question).

I hope that helps!

(Nicholas Bokulich) #4

(Shuqi Li) #5

Hi Nicholas,

Thank you for kind response!

I’ve looked into the q2-quality-filtering plugin, should I use
“–p-max-ambiguous 3” or “–p-quality-window 3” to accomplish this?

I also tried other ways to denoise the raw reads and generated FeatureTable by following the steps below:

If using q2-quality-filter plugin

$qiime quality-filter q-score
–i-demux demux.qza
–p-min-quality 19
–p-quality-window 3
–output-dir qscore-plugin
$qiime vsearch dereplicate-sequences
–i-sequences filtered_sequences.qza
–o-dereplicated-table qscore-table.qza
$qiime feature-table summarize
–i-table qscore-plugin/qscore-table.qza
–o-visualization qscore-plugin/qscore-table.qzv
–m-sample-metadata-file metadata.tsv

The [FeatureTable] qscore-table.qzv showed that there are 11,600,048 sequences retained (60.46% of total). And the 4 categories in the column ‘location’ of the metadata all retained a portion of samples. This proportion was the highest of all my denoising methods among quality-filter, dada2 and deblur.

If using q2-dada2 plugin:

$qiime dada2 denoise-paired
–i-demultiplexed-seqs demux.qza
–p-trunc-len-f 0
–p-trunc-len-r 0
–p-trunc-q 19
–o-denoising-stats q19-stats.qza
–o-representative-sequences q19-rep-seqs.qza
–o-table q19-table.qza
$qiime feature-table summarize
–i-table q19-table.qza
–o-visualization q19-table.qzv
–m-sample-metadata-file metadata.tsv

The [FeatureTable] q19-table.qzv showed that there are 22,515 sequences retained (0.12% of total). Although the 4 categories in the column ‘location’ of the metadata still all retained a portion of samples, some of the sample contains zero frequency count, which is so weird to me.

My question would be, what is different in ‘–p-min-quality’ in q2-quality-filter plugin and ‘–p-trunc-q’ in q2-dada2 plugin? Why they are giving so different denoised FeatureTable?

In addition, I also tried to denoise the demux.qza totally by following the steps showed in the qiime2/2018.8 tutorial.
In the demux.qzv quality plots, I see that the quality of the initial bases seems to be high after position 13, so I trimmed 13 bases from the beginning of the sequences. The quality seems to drop off around position 256, so we’ll truncate our sequences at 256 bases.

$qiime dada2 denoise-paired
–i-demultiplexed-seqs demux.qza
–p-trim-left-f 13
–p-trim-left-r 13
–p-trunc-len-f 256
–p-trunc-len-r 256
–o-representative-sequences rep-seqs-dada2.qza
–o-table table-dada2.qza
–o-denoising-stats stats-dada2.qza
$qiime feature-table summarize
–i-table table-dada2.qza
–o-visualization table–dada2.qzv
–m-sample-metadata-file sample-metadata.tsv

The [FeatureTable] table-dada2.qzv showed that there are 2,914,846 sequences retained (15.19% of total). This time not all the 4 categories in the column ‘location’ of the metadata retained–only 2 remained.

Could you explain why is this, please?

Looking forward to your reply.

Many thanks,


(Nicholas Bokulich) #6


The qscore method is not a denoising method, and is NOT a replacement for dada2, deblur, or OTU picking. You should use one of those methods after the qscore method.

You will need to review the dada2 stats file output by that command, but it sounds like your paired-end reads are not long enough to join. Check the stats file to confirm. If not, you will have two options:

  1. use a lower setting for trunc-q than 19.
  2. Only use the forward reads and proceed as if they were single-end reads.

Those are effectively the same parameter, but dada2 is truncating at the first base < Q, whereas qscore is truncating wherever there are 3 in a row. Additionally, dada2 is performing additional denoising/filtering that may remove more sequences.

The other categories must have failed to join and there are 0 sequences remaining for those categories.

I hope that helps.

(Shuqi Li) #7

Thank you so much Nicholas! It took me long to response just because I am so far behind in this area…but your suggestion definitely helped a lot! thanks again!