Dada2 - very slow on error rate with 18S fungal data, need some advice

Hey everyone, new to the forums here. I'm using QIIME2 to assign taxonomy to both 16S and 18S data i have. I've succesfully gotten taxonomy with 16S, but now that i try to implement it for my 18S data i'm running into an issue with dada2. It will take around a day just to parse one of my samples.

The data contains Illumina paired-end 18S reads with the adapters trimmed off, 150 mb for both forward and reverse read, the command i'm running is: qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trunc-len-f 250 --p-trunc-len-r 160 --p-trim-left-r 0 --o-table {output.table} --o-representative-sequences {output.seqs} --o-denoising-stats {output.stats} --p-n-threads 0

Denoising only the forward read leads to the same result

I've loaded a sample in R to see where the problem was, and it gets stuck on learning the error rates:
errF <- learnErrors(derepFs, nbases=2e6, multithread=TRUE)
errR <- learnErrors(derepRs, nbases=2e6, multithread=TRUE)

53884560 total bases in 224519 reads from 1 samples will be used for learning the error rates.

I've found similiar topics on the matter such as:

Based on these i've ensured to upgrade to the latest versions, ensured all cores are being used. I've checked with the source of the samples and it should only contain fungal samples without a lot of contamination.

I've tried deblur as well against the SILVA132 18s 99 reference set, but after 2 hours i killed the process as well. The exact command was: qiime deblur denoise-other --i-demultiplexed-seqs paired-end-demux.qza --i-reference-seqs ref-18S_SILVA_132_99.qza --o-table deblur-table.qza --o-representative-sequences rep-seqs.qza --o-stats stats.qza --p-trim-length -1 --p-jobs-to-start 4

I know it can simply take several hours/days if i have a lot of unique sequences, if this is the case, are there any alternatives to the qiime2 denoisers that would work on the scale of the project? I'm looking to process 20 of these samples in one day.

I'm running QIIME2 in virtualbox ubuntu within a conda environment. I'm using version 2019.7

Thanks in advance!

Good afternoon @wjschuiten

Welcome to the Qiime 2 forums! :qiime2:

I'm glad you have done your research into this issue and looked up related issues on the forums. I think you have correctly identified the issue.

This step needs a lot of RAM, but you can decreate the RAM need by passing a lower number to --p-n-reads-learn, say
--p-n-reads-learn 10000 (10k)
instead of the default of
--p-n-reads-learn 1000000 (1 million)

How much RAM does your VM have? You should try to give it as much RAM as possible.

Colin

1 Like

Hey @colinbrislawn, thanks for your reply

I’ve ran dada2 outside of my virtualbox in R, i have 8 cores and 16 gig of RAM (VM takes 10, can push to 14 but doesn’t seem to make much of a difference). On learning error rates it’s using about 3 gigs of RAM with nbases 1 million. But after 1 hour it’s still not done with the forward read of one sample (~200 mb). Lowering nbases doesn’t seem to significantly affect the process other than lowering RAM usage.

So from what i’ve understood it’s possible that dada2 may just need this long for these samples, in which case im curious what alternatives i have to get a decent taxonomy within a manageable time frame.

Thank you,

Wouter

Hello Wouter,

OK, it sounds like you do have enough RAM, so that’s good.

This process does take some time, and I think that’s normal. As long are your RAM is not fully used, this process should be able to finish.

What is the total file size of your data set?

Colin

Hey Colin,

The current dataset is 8 samples, with the forward/reverse reads being 200mb each, so 3.2 gigs. Depending on soil, we can get up to 20 sequenced samples, so up to 8 gig of data.

If i run a single read sample, i can get Dada2 to run in 2 hours per read file. So that would translate to 10 hours per gigabyte? We’re looking to get taxonomy for 20 samples per day (if possible).

As far as i see, this is normal. Would you have any suggestions or alternative routes to speed up the process?

Thank you,
Wouter

Hello Wouter,

I think dada2 does a training phase where it learns the error rate of the run, then it denoises the full run. So hours per gigabyte should be smaller when processing 2000 MB vs 200 MB because only one training stage is needed. You said you have 8 threads, so try running all 8 threads and see if that keep the RAM from maxing out.

I heard from a friend that running dada2 directly in R is faster than running it through the Qiime 2 plugin. I’m not sure if this is still the case, but if you are comfortable using R, you could try the dada2 big data tutorial and see how it works for your data.
https://benjjneb.github.io/dada2/bigdata.html

Colin