Dada2 - very slow on error rate with 18S fungal data, need some advice

wjschuiten · October 16, 2019, 11:35am

Hey everyone, new to the forums here. I'm using QIIME2 to assign taxonomy to both 16S and 18S data i have. I've succesfully gotten taxonomy with 16S, but now that i try to implement it for my 18S data i'm running into an issue with dada2. It will take around a day just to parse one of my samples.

The data contains Illumina paired-end 18S reads with the adapters trimmed off, 150 mb for both forward and reverse read, the command i'm running is: qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trunc-len-f 250 --p-trunc-len-r 160 --p-trim-left-r 0 --o-table {output.table} --o-representative-sequences {output.seqs} --o-denoising-stats {output.stats} --p-n-threads 0

Denoising only the forward read leads to the same result

I've loaded a sample in R to see where the problem was, and it gets stuck on learning the error rates:
errF <- learnErrors(derepFs, nbases=2e6, multithread=TRUE)
errR <- learnErrors(derepRs, nbases=2e6, multithread=TRUE)

53884560 total bases in 224519 reads from 1 samples will be used for learning the error rates.

I've found similiar topics on the matter such as:

github.com/benjjneb/dada2

learn Error issue

opened 02:16PM - 23 Jan 19 UTC

closed 08:23PM - 13 Mar 19 UTC

FloraVincent

Hi everyone, I have paired end data from illumina Miseq, 150 bp (115 after trim…ming). I just started with one sample to check all was good (not my first time with DADA2). I use dada2 to trim the primers using TrimLeft with the folllowing command: ## out <- filterAndTrim(fnFs[9], filtFs[9], fnRs[9], filtRs[9], truncLen=c(145,145), maxN=0, maxEE=c(2,2), truncQ=2, trimLeft = c(30,29), rm.phix=TRUE, compress=TRUE, multithread=TRUE, matchIDs = TRUE) ## Below is the output of "errF <- learnErrors(filtFs[9], multithread=TRUE)" (filtFs[9] is just one of my fastQ R1): ## Initializing error rates to maximum possible estimate. Sample 1 - 154185 reads in 79013 unique sequences. selfConsist step 2 selfConsist step 3 selfConsist step 4 ## It is taking a LOT of time compared to my previous runs, around 3h just to ouput the previous line. It's running on a big server, so I wonder if something is wrong, either in the data or in the commands. Several questions: - Is it expected to take so much time? In which case I just let it run - I suspect human contamination in the data set (that will reasonably increase the fastq file); would it help to get rid of the contaminants directly in the fastq files before all the dada processing ? Thanks for help and technical support on dada2 ! Flora

Based on these i've ensured to upgrade to the latest versions, ensured all cores are being used. I've checked with the source of the samples and it should only contain fungal samples without a lot of contamination.

I've tried deblur as well against the SILVA132 18s 99 reference set, but after 2 hours i killed the process as well. The exact command was: qiime deblur denoise-other --i-demultiplexed-seqs paired-end-demux.qza --i-reference-seqs ref-18S_SILVA_132_99.qza --o-table deblur-table.qza --o-representative-sequences rep-seqs.qza --o-stats stats.qza --p-trim-length -1 --p-jobs-to-start 4

I know it can simply take several hours/days if i have a lot of unique sequences, if this is the case, are there any alternatives to the qiime2 denoisers that would work on the scale of the project? I'm looking to process 20 of these samples in one day.

I'm running QIIME2 in virtualbox ubuntu within a conda environment. I'm using version 2019.7

Thanks in advance!

colinbrislawn · October 16, 2019, 6:01pm

Good afternoon @wjschuiten

Welcome to the Qiime 2 forums! :qiime2:

I'm glad you have done your research into this issue and looked up related issues on the forums. I think you have correctly identified the issue.

This step needs a lot of RAM, but you can decreate the RAM need by passing a lower number to --p-n-reads-learn, say
--p-n-reads-learn 10000 (10k)
instead of the default of
--p-n-reads-learn 1000000 (1 million)

How much RAM does your VM have? You should try to give it as much RAM as possible.

Colin

wjschuiten · October 21, 2019, 10:58am

Hey @colinbrislawn, thanks for your reply

I've ran dada2 outside of my virtualbox in R, i have 8 cores and 16 gig of RAM (VM takes 10, can push to 14 but doesn't seem to make much of a difference). On learning error rates it's using about 3 gigs of RAM with nbases 1 million. But after 1 hour it's still not done with the forward read of one sample (~200 mb). Lowering nbases doesn't seem to significantly affect the process other than lowering RAM usage.

So from what i've understood it's possible that dada2 may just need this long for these samples, in which case im curious what alternatives i have to get a decent taxonomy within a manageable time frame.

Thank you,

Wouter

colinbrislawn · October 21, 2019, 1:21pm

Hello Wouter,

OK, it sounds like you do have enough RAM, so that's good.

This process does take some time, and I think that's normal. As long are your RAM is not fully used, this process should be able to finish.

What is the total file size of your data set?

Colin

wjschuiten · October 21, 2019, 2:57pm

Hey Colin,

The current dataset is 8 samples, with the forward/reverse reads being 200mb each, so 3.2 gigs. Depending on soil, we can get up to 20 sequenced samples, so up to 8 gig of data.

If i run a single read sample, i can get Dada2 to run in 2 hours per read file. So that would translate to 10 hours per gigabyte? We're looking to get taxonomy for 20 samples per day (if possible).

As far as i see, this is normal. Would you have any suggestions or alternative routes to speed up the process?

Thank you,
Wouter

colinbrislawn · October 21, 2019, 3:04pm

Hello Wouter,

I think dada2 does a training phase where it learns the error rate of the run, then it denoises the full run. So hours per gigabyte should be smaller when processing 2000 MB vs 200 MB because only one training stage is needed. You said you have 8 threads, so try running all 8 threads and see if that keep the RAM from maxing out.

I heard from a friend that running dada2 directly in R is faster than running it through the Qiime 2 plugin. I'm not sure if this is still the case, but if you are comfortable using R, you could try the dada2 big data tutorial and see how it works for your data.
https://benjjneb.github.io/dada2/bigdata.html

Colin