I used dada2 to denoise my pair-ended reads in qiime2-2019.7 and R (dada2 version, 1.12.1), and got 3533 and 2535 features in the raw ASV table, respectively. The differences in the parameters I used in R were setting pool=TURE for sample inference, and using method=pooled for chimera removing. I was expecting to see more features in the ASV table deniosed by dada2 with pool=TURE as singleton reads were retained. While there's a big difference in the number of features, the total number of sequences after denoising are quite similar, being 7,717,544 and 7,540,610 respectively.
I noticed that dada2 in R uses 1e8 bases for denoising whereas qiime2-2019.7 uses 1 million reads. But I doubt this is the cause. Could someone explain why I'm seeing such a big difference in the number of unique features obtained in qiime2 and R?
This is a great question, and one I’m not super confident in answering, but something that might be informative is the respective denoising stats from both analyses. This will tell us if many features are removed at chimera checking or somewhere else.
A naive guess would be that by pooling your data, the more uncertain and “unique” ASVs were collapsed together, since they were identified as part of the same expected distribution across samples. But it does seem very odd that the process removed 1000 features (it may just be chimera checking is working better after pooling, but we’d need the denoising report/table to know more).
summarize sequence data and visulaize reads quality
qiime demux summarize \
--i-data data/qiime2/demux.qza \
--o-visualization data/qiime2/demux.qzv
#Sequence quality control and feature table construction based on sequence quality
qiime dada2 denoise-paired \
--i-demultiplexed-seqs data/qiime2/demux.qza \
--p-trim-left-f 20 \
--p-trim-left-r 18 \
--p-trunc-len-f 290 \
--p-trunc-len-r 248 \
--p-n-threads 16 \
--o-table data/qiime2/table.qza \
--o-representative-sequences data/qiime2/rep-seqs.qza \
--o-denoising-stats data/qiime2/stats.qza \
--verbose \
quality check: number of sequences kept at each step of the dada2 pipeline
qiime metadata tabulate \
--m-input-file data/qiime2/stats.qza \
--o-visualization data/qiime2/stats.qzv
#Overview of raw feature-table
qiime feature-table summarize \
--i-table data/qiime2/table.qza \
--m-sample-metadata-file data/metadata.tsv \
--o-visualization data/qiime2/table.qzv
The RMarkdown report is wonderful. Looking at the specifics, I don’t see anything terribly different between the two report tables, we’re up or down a few hundred/thousand reads and neither analysis is clearly that different overall.
One thing that does catch my eye:
In the R report, I see that the dimensions of the table are as follows:
## [1] 80 7590
Which is pretty different from the 2,535 we see after importing. Is it at all possible that something has gone wrong in between these steps? Not that the difference between between 3,533 and 7,590 is any easier to explain, but something seems strange here.
Hi @ebolyen, I also noted the exceptionally high number of unique features before removing chimeras. I reran the analyses and found that unique features were reduced from 7590 to 2535 during the chimera removing. However, I didn't see a significant drop in sequence number per sample (5% on average).
dim(seqtab)
[1] 80 7590
dim(seqtab.nochim)
[1] 80 2535
For the amplicon sizes, I did have the same doubt. The observed amplicon sizes are within the expected size ranges based on the in silico evaluation.
Hi @yanxianl,
I just wanted throw in 2 quick comments. 1) I really enjoyed your rmarkdown file, I wish everyone used and published these! I am working on a comprehensive version of one atm which I think you might enjoy which I’ll share when it is complete. 2) If I had to guess I would think the main reasons for the unexpected lower # features is that with pooled chimera removal DADA2 is able to more accurately detect chimeras and so is actually appropriately removing more features that are chimeras, compared to q2-dada2 deault which is not pooling. If you have doubts about these you can always blast the removed sequences and see what they actually were, or try increasing the minFoldParentOverAbundance to higher values, starting say with 8? That will likely retain more of those features.
I agree with @Mehrbod_Estaki that the difference here is probably driven largely by the chimera removal method, and is probably playing out mostly amongst low frequency sequences.
Pooled sample inference and pooled chimera removal versus independent sample inference and consensus chimera removal is a significant shift. It won’t (in my testing) change much in terms of gross community composition, but it has differences especially amongst rare and infrequent taxa.
I tried to replicate the qiime2-dada2 analyses in R by setting dada(pool=FALSE, selfConsist = TRUE, ...) and removeBimeraDenovo(method="consensus", ...), and I got a similar number of unique features (3642).
dim(seqtab)
[1] 80 6470
dim(seqtab.nochim)
[1] 80 3642
sum(seqtab.nochim)/sum(seqtab)
[1] 0.975
Indeed, setting pool=TRUE resulted in more features but a much higher proportion of these features were removed during the detection of chimeras under the pooled mode. That's why I ended up with less uniqe features.