Reproduce QIIME2 DADA2 results in DADA2 R

Lou · November 5, 2020, 10:54pm

Hi everyone,

I am trying to reproduce a result in R that I received from running DADA2 in QIIME2. These are the differences in the number of sequences that I observe (in one sample as an example):

R vs. QIIME2
After quality filtering: 675425 vs. 675425 reads (identical)
After denoising: 674298 vs. 674029 reads
After merging: 665461 vs. 665128 reads

I know that these differences are really small, but I am interested in the mechanism behind these differences. I already tried to adjust nbases in R learnErrors() according to QIIME2 default (1e+06) and added set.seed(100) before learnErrors().

I was wondering if DADA2 in QIIME2 uses something like set.seed() and if yes, with which number?
Can these differences be due to different DADA2 versions? I am using QIIME2 2020.2 (via VirtualBox on a Windows machine) and DADA2 1.16.0 in R 4.0.2.
I read here (DADA2 denoise-single chimera removal inconsistence - #21 by benjjneb) that there may be differences in DADA2 results depending on sample order, but the sequences I am using in QIIME2 and R are coming from the same demultiplexed folder, so I don't see how this could be a problem here.

Does anyone have any other idea where these small differences are coming from and how to get rid of them?

This is the DADA2 command I used in QIIME2 if it's helping:
qiime dada2 denoise-paired
--i-demultiplexed-seqs demux-full.qza
--p-trim-left-f 0
--p-trim-left-r 1
--p-trunc-len-f 150
--p-trunc-len-r 150
--p-chimera-method none
--o-table table.qza
--o-representative-sequences rep-seqs.qza
--o-denoising-stats denoising-stats.qza

Thank you!

benjjneb · November 5, 2020, 11:16pm

IMO the most likely cause of this is the difference in package versions.

Although there have been no major changes in the normal operation of the core denoising algorithm on Illumina data between 1.10 and 1.16, there have been some changes on the margins (e.g. fixing an edge-case kmer counting bug) that can lead to the kind of small differences you are seeing.

One thing to try, if you are comfortable working in R, is to do the R analysis with version 1.10 as is used by Q2. This can be straightforwardly installed in R with

library(devtools)
devtools::install_github("benjjneb/dada2", ref="v1.10")

Lou · November 6, 2020, 1:09pm

Thanks a lot for your quick reply @benjjneb !

I tried your suggestion and used DADA2 1.10 in R, and now I am getting a third set of values:

R DADA 1.10 vs. R DADA 1.16 vs. QIIME2 2020.2
After quality filtering: 675425 vs. 675425 vs. 675425 reads (still identical)
After denoising: 673919 vs. 674298 vs. 674029 reads
After merging: 665028 vs. 665461 vs. 665128 reads

I also tried to run the exact same thing in QIIME2 again, and received the exact same results as before in QIIME2, so the DADA2 implementation in QIIME2 is totally reproducible.

Is it possible that there is still some randomness in the DADA2 process in R that I am not covering with set.seed()? Is QIIME2 using a specific seed in DADA2 that is different from set.seed(100), which you are using in your tutorials?

thermokarst · November 6, 2020, 2:23pm

Perhaps for the sake of comparison you should do that "pure R" run in your QIIME 2 environment. This will ensure you are running the exact same dependency versions in the pure R DADA2 vs q2-dada2.

No need to install anything special in the QIIME 2 environment - everything is there already.

:qiime2:

Lou · November 6, 2020, 9:10pm

Thank you for suggesting this idea @thermokarst! I tried it and it works. So with DADA2 version 1.10 in R in the virtual machine I am receiving exactly the same results as with DADA2 in QIIME2 (2020.2) in the same virtual machine.
Thanks a lot to both of you!

system · December 8, 2020, 3:10am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.