Dada2 very slow step

lmanchon · November 22, 2018, 6:13pm

--Hi,

I started dada2 and the program has been running for 2 days, I use 40 cores, the program is frozen at this stage:
R version 3.4.1 (2017-06-30)
Loading required package: Rcpp
DADA2 R package version: 1.6.0

Filtering ................................................................................
Learning Error Rates
Not all sequences were the same length.
Not all sequences were the same length.
2a) Forward Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 6451496 reads in 5919870 unique sequences.

I think I'll use deblur, it's faster and according to the posts I read the results are close.
What do you think ?

thx

ebolyen · November 26, 2018, 10:12pm

Hi @lmanchon,

Did you adjust --p-n-reads-learn at all? The default is 1 million reads which usually doesn't take that long, although since your reads are variable length that might have something to do with it.

Could you post the full command? How many samples and reads do you have?

lmanchon · November 27, 2018, 8:39am

--Hi,

this is the full command i used:

qiime dada2 denoise-paired --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 0 --p-trunc-len-r 0 --i-demultiplexed-seqs demux-paired-end.qza --o-representative-sequences repset.qza --o-table table.qza --o-denoising-stats denoising-stats.qza --verbose --p-n-threads 40

I have 80 samples (paired-end 2x150bp trimmed), libraries are not uniform after cleaning by cutadapt, i have between 5 and 60 millions reads per library.

thanx

ebolyen · November 28, 2018, 6:46pm

Hi @lmanchon,

Thanks for the info, parameters look good since you've already cleaned it with cutadapt (I assume this means you've trimmed the forward primers already since your trim-left params are 0?)

How many cores are actually available on the CPU you are running this on? Do you have a 64-core machine? If you over-commit the number of threads all you end up doing is slowing down the process as now different threads have to compete for the same CPU core.

lmanchon · November 29, 2018, 8:42am

--Hi,

i have 72 cores (2 CPU socket, 36 cores each).
Maybe i need to set --p-n-reads-learn (200000 ?)

thanx

thermokarst · December 14, 2018, 2:30am

Hey there @lmanchon - did this ever finish?

lmanchon · December 14, 2018, 8:17am

not yet, so i have killed the process.
I will be valuable if i send you a sample of my libraries to check what happen, maybe i need to adjust the p-trim and p-trunc parameters correctly.
Tell me if it's possible to give you a link to download my sample.

thank you --

Nicholas_Bokulich · December 18, 2018, 1:29am

sorry for the delay, the QIIME 2 team is at back-to-back workshops these two weeks.

You can just post the QZV directly here — that will be the best way for us to assess your trimming parameters.

dada2 is a little bit faster when run directly in R (though it should still have the same memory requirements). You could try running directly in R to speed things up if you need.

lmanchon · December 18, 2018, 8:52am

--Hi,
i know dada2 is faster when run directly in R, but from R i can't generate qza file to then follow the qiime pipeline.
You can download one sample of my raw data from this link: FileSender
These files are not trimmed, (150bp in paired-end using Nextera XT transposase adapters). This sample is the smallest, with only 2 millions of reads. The biggest one has 80 millions of reads, see attached qzv file.

thank you --demux.qzv (284.8 KB)

Nicholas_Bokulich · December 18, 2018, 2:14pm

The quality looks good to me. You could probably get away with only trimming the final base off of each read.

Check out the source code from q2-dada2... you can generate QZAs if you follow their complete pipeline.

Otherwise I think you just need to wait — you are denoising multiple libraries and it will take time.

Good luck!

mudbugecology · December 18, 2018, 3:31pm

I am wondering if you think I am experiencing the same problem. When I ran the denoising step on my PC virtual machine at the workshop last week evan thought that running it in that machine caused it to crash (we increased the memory and cpus for the virtual machine and it still crashed on this step) and suggested running it in the windows linux or on my mac. I was able to install q2 on my mac and I went through the steps to get to the following dada2 denoise-paired step and it has been running for about 48 hours without finishing this step.

(qiime2-2018.11) Kyles-MacBook-Pro:crayfishmicrobiome2018 kyleharris$ qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --p-trim-left-f 0 --p-trim-left-r 0 --p-trunc-len-f 250 --p-trunc-len-r 250 --o-table table.qza --o-representative-sequences rep-seqs.qza --o-denoising-stats denoising-stats.qza

[Uploading: demux.qza...]

Any advice or simply let it keep running?

Thank you, Kyle

mudbugecology · December 18, 2018, 8:42pm

Uploading: demux.qza...

Nicholas_Bokulich · December 18, 2018, 8:44pm

Hi @mudbugecology,
It looks like your demux.qza file did not upload correctly — it is probably too large for the site to handle!

In any case, no news is good news as far as dada2 error messages go. Just leave it running — for large datasets using a single CPU it is not unusual to wait a few days...

mudbugecology · December 18, 2018, 9:58pm

thanks @Nicholas_Bokulich! It is still running ...