Long analysis time

Nir_Friedman · September 12, 2019, 9:51am

Hi,
I have analyzed 18 samples (36 fastq files) and used the following command:
qiime dada2 denoise-paired --i-demultiplexed-seqs /home/qiime2/Desktop/xxxx.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 240 --p-trunc-len-r 240 --p-n-threads 0 --o-representative-sequences xxx_rep-seqs.qza --o-table 6_xxx_table.qza --o-denoising-stats 6_xxx_dada2.qza --verbose
and It took me 8 days to get the outputs.
I have 2 questions:

Why did it take so long and how can I analyze future samples using dada faster?
I have analyzed the same set of samples using CLC platform and got very different results. The most different was the number of reads where in CLC I got an average of more than 100,000 reads and with dada 10,000 reads. ( The sequence depth was very high, almost 300 k reads per forward and 300k reads per reverse).
Thanks for your answer,
Nir

Nicholas_Bokulich · September 12, 2019, 2:22pm

dada2 runtime is very difficult to predict — it is impacted by the number and length of reads and especially the complexity of those reads. So datasets with very high diversity will take longer to run.

To decrease runtime:

run in parallel (I see you already are using the --p-n-threads parameter)
use the latest release of QIIME 2, which has the latest and most efficient version of dada2

Sounds like CLC uses OTU clustering, not denoising. I recommend reading more on this forum and especially in the original dada2 paper to understand the differences between OTU clustering and denoising, and why you are seeing much higher diversity with CLC. In a nutshell, dada2 will do a much much better job of weeding out and correcting sequencing errors, OTU clustering just lumps everything together, which a) leaves in spurious sequences leading to a long tail of rare errors and b) potentially reduces sensitivity to true biological variations.

Nir_Friedman · September 13, 2019, 9:33am

Dear Nicholas ,
Thanks for your reply.
Regarding dada2, is there another way to run the analysis in parallel other than threads (0)?
Regarding OTU clustering using CLC or qiime1, my problem is not only the number of analyzed reads in dada2 ( almost 10% as compared to OTU clustering) but also their taxonomic clustering.
I got huge differences also at high taxonomic levels ( phyla and class) between these methods while using the same files.
Is it something common according to your experience or I am missing something?
p.s.
I just wanted to add that after "selfConsist step 10" both for forward and reverse I got some kind of error message. Dont know if it helps....
Thanks,
Nir

colinbrislawn · September 13, 2019, 1:19pm

Hello Nir,

If you used the same input files but different taxonomy assignment methods or databases, this could dramatically change your results! However if you use the same taxonomy assignment method and same database with dada2 ASVs and Qiime 1 OTUs, the overall taxonomic composition should be very similar.

Every step in the pipeline matters, so keeping track of each step when comparing results will help you identify which step is changing results the most. Here, my intuition is that the taxonomy assignment of Qiime 1 and Qiime 2 is more different than OTUs vs dada2, but you would have to try it to see.

Colin