Dear all,
I am using qiime2 to analyze the 16S data of intestinal flora. The primers used are 341F ( ACTCCTACGGGAGGCAGCA ) and 806R ( GGACTACHVGGGTWTCTAAT ). I have encountered some problems in processing data. The main problem is the denoising step. The relevant code is as follows:
import data
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path manifest.txt
--output-path demux.qza
--input-format PairedEndFastqManifestPhred33
visualization
qiime demux summarize
--i-data demux.qza
--o-visualization demux.qzv
cutadapt
time qiime cutadapt trim-paired
--i-demultiplexed-sequences demux.qza
--p-cores 10
--p-front-f 'ACTCCTACGGGAGGCAGCA'
--p-front-r 'GGACTACHVGGGTWTCTAAT'
--p-discard-untrimmed
--o-trimmed-sequences trimmed-demux.qza
visualization
qiime demux summarize
--i-data trimmed-demux.qza
--o-visualization trimmed-demux.qzv
denoise
time qiime dada2 denoise-paired
--i-demultiplexed-seqs trimmed-demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 220
--p-trunc-len-r 220
--p-min-fold-parent-over-abundance 8
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza
--p-n-threads 0
visualization for feature-table
qiime feature-table summarize
--i-table table.qza
--o-visualization table.qzv
non-singleton
qiime feature-table filter-features
--i-table table.qza
--p-min-frequency 2
--o-filtered-table table_filtered.qza
visualization
qiime feature-table summarize
--i-table table_filtered.qza
--o-visualization table_filtered.qzv
Questions are as follows:
First, I want to know how to ensure the best cut-off position when using the command “time qiime dada2 denoise-paired”, especially when the quality of the sequence is generally good at all truncation positions. It seems that the parameter of --p-trunc-len-f and --p-trunc-len-r have a serious influence on the final numbers of features and frequency.
Details about my test for this question are as follows:
By observing the result of quality plot in trimmed-demux.qzv, we can see that the quality of the sequence is generally good at all truncation positions.
So when using “time qiime dada2 denoise-paired” to denoise the data, I set the values of -p-trunc-len-f and -p-trunc-len-r to 200 and 200, respectively. Then I got the result file “table.qzv” after visualization and “table_filtered.qzv” by the command “qiime feature-table filter-features”, results are as follows:
I lost half of the total numbers of features after the command “qiime feature-table filter-feature”, where I set the parameter of --p-min-frequency at 2. Besides, I doubt whether 66567 features is enough or not. So, I tried many parameters by changing the value of --p-trunc-len-f and --p-trunc-len-r. The results are as follows:
It seems that only when the sum of the length of the positive and negative sequences is 442, and --p-trunc-len-f is small and --p-trunc-len-r is large, can the number of features and frequency be better balanced. The results also showed that even if the overall quality of the sequence is good, different intercept lengths will lead to large differences, and it is not the longer the length of the reservation, the more the number of features. How could this happen? And how to ensure the best cut-off position?
Second, I want to know the difference between handling all samples at once and only one sample at a time, and which method should I use in different situations. Following details may help you guys better understanding my question.
For the former method, considering that I have 84 samples, I just create a manifest.txt when importing data, where it contains the information for all samples, and the following process are normal. For the latter method, I need to create 84 manifest.txt files at the beginning, each manifest.txt contains a sample’s information, by experiencing the code I offered all above, I will get 84 table.qza, then I can use the command “qiime feature-table merge” to merged 84 table.qza files and visualize it.
Then, I got a large difference between this two methods. By handling only one sample at a time, the final numbers of features and frequency are much bigger comparing with handling all samples at once. I know that in the process of “time qiime dada2 denoise-paired”, the programme will create a error model based on all samples used, which lead to the difference, but which method is better? How to evaluate them?
Third, when I execute the command “time qiime dada2 denoise-paired --help”, I found two interesting parameters,the “--p-pooling-method” and “--p-chimera-method”.
I wonder how much influence different methods have on the results (independent or pseudo, consensus or pooled)?
I would appreciate it if anyone could help me solve these questions, we can discuss them together!