Several questions about qiime dada2 denoise-paired (about 700 words)

Dear all,

I am using qiime2 to analyze the 16S data of intestinal flora. The primers used are 341F ( ACTCCTACGGGAGGCAGCA ) and 806R ( GGACTACHVGGGTWTCTAAT ). I have encountered some problems in processing data. The main problem is the denoising step. The relevant code is as follows:

import data

qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest.txt
--output-path demux.qza
--input-format PairedEndFastqManifestPhred33

visualization

qiime demux summarize
--i-data demux.qza
--o-visualization demux.qzv

cutadapt

time qiime cutadapt trim-paired
--i-demultiplexed-sequences demux.qza
--p-cores 10
--p-front-f 'ACTCCTACGGGAGGCAGCA'
--p-front-r 'GGACTACHVGGGTWTCTAAT'
--p-discard-untrimmed
--o-trimmed-sequences trimmed-demux.qza

visualization

qiime demux summarize
--i-data trimmed-demux.qza
--o-visualization trimmed-demux.qzv

denoise

time qiime dada2 denoise-paired
--i-demultiplexed-seqs trimmed-demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 220
--p-trunc-len-r 220
--p-min-fold-parent-over-abundance 8
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza
--p-n-threads 0

visualization for feature-table

qiime feature-table summarize
--i-table table.qza
--o-visualization table.qzv

non-singleton

qiime feature-table filter-features
--i-table table.qza
--p-min-frequency 2
--o-filtered-table table_filtered.qza

visualization

qiime feature-table summarize
--i-table table_filtered.qza
--o-visualization table_filtered.qzv

Questions are as follows:
First, I want to know how to ensure the best cut-off position when using the command “time qiime dada2 denoise-paired”, especially when the quality of the sequence is generally good at all truncation positions. It seems that the parameter of --p-trunc-len-f and --p-trunc-len-r have a serious influence on the final numbers of features and frequency.

Details about my test for this question are as follows:
By observing the result of quality plot in trimmed-demux.qzv, we can see that the quality of the sequence is generally good at all truncation positions.


So when using “time qiime dada2 denoise-paired” to denoise the data, I set the values of -p-trunc-len-f and -p-trunc-len-r to 200 and 200, respectively. Then I got the result file “table.qzv” after visualization and “table_filtered.qzv” by the command “qiime feature-table filter-features”, results are as follows:


I lost half of the total numbers of features after the command “qiime feature-table filter-feature”, where I set the parameter of --p-min-frequency at 2. Besides, I doubt whether 66567 features is enough or not. So, I tried many parameters by changing the value of --p-trunc-len-f and --p-trunc-len-r. The results are as follows:

It seems that only when the sum of the length of the positive and negative sequences is 442, and --p-trunc-len-f is small and --p-trunc-len-r is large, can the number of features and frequency be better balanced. The results also showed that even if the overall quality of the sequence is good, different intercept lengths will lead to large differences, and it is not the longer the length of the reservation, the more the number of features. How could this happen? And how to ensure the best cut-off position?

Second, I want to know the difference between handling all samples at once and only one sample at a time, and which method should I use in different situations. Following details may help you guys better understanding my question.

For the former method, considering that I have 84 samples, I just create a manifest.txt when importing data, where it contains the information for all samples, and the following process are normal. For the latter method, I need to create 84 manifest.txt files at the beginning, each manifest.txt contains a sample’s information, by experiencing the code I offered all above, I will get 84 table.qza, then I can use the command “qiime feature-table merge” to merged 84 table.qza files and visualize it.

Then, I got a large difference between this two methods. By handling only one sample at a time, the final numbers of features and frequency are much bigger comparing with handling all samples at once. I know that in the process of “time qiime dada2 denoise-paired”, the programme will create a error model based on all samples used, which lead to the difference, but which method is better? How to evaluate them?

Third, when I execute the command “time qiime dada2 denoise-paired --help”, I found two interesting parameters,the “--p-pooling-method” and “--p-chimera-method”.

I wonder how much influence different methods have on the results (independent or pseudo, consensus or pooled)?

I would appreciate it if anyone could help me solve these questions, we can discuss them together!

Hello Jed,

Welcome to the forums! :qiime2:

Before I dive in, I want to suggest that specific, short questions usually get answered faster, because more people read them and it's easier to answer one question than many.

I really liked this part of your analysis:

So, I tried many parameters by changing the value of --p-trunc-len-f and --p-trunc-len-r. The results are as follows:
...
And how to ensure the best cut-off position?

Your method is perfect: try a parameter sweep and pick the setting that works best!

DADA2 first truncates the reads, then joins them, then denoises the full run, so the choice of truncating settings is very powerful, as you have seen.

Second, I want to know the difference between handling all samples at once and only one sample at a time, and which method should I use in different situations.

It is recommended to run DADA2 per sequencing run, through results should be stable as long as the exact same settings are used. See DADA2 workflow for big data and this answer by Justine.

This would be a little surprising if all the same settings were used! If I were investigating this, I would disable chimera checking 'none' and see if the results are consistent before this step.

Let us know what you try next!

1 Like