Time of running DADA2

24 hr with 50 threads does sound quite long on a normal-sized dataset. Are you running this on a compute cluster that can allocate that many threads to you? You may want to check on the job to see its resource use, make sure everything is okay.

Other than that, no news is good news as far as dada2 is concerned. It can take time to run, and multiple days is not unusual.

Good luck!

Thanks for your reply. I use sever (max worker threads is 80) to run this step. But I find just 1 thread is used to work now. Is there any other command should be added to make sure the multiple threads working?

No, there is not another command that is needed. This is probably a server-side issue; e.g., you may need to request an appropriate amount of resources when running the job — you should consult your server admins to discuss this.

Another possibility is that the dada2 pipeline is currently at a step that cannot run on multiple threads, and it was using multithreading at an earlier step that is now finished.

Even if your job is only running on a single thread, it should probably finish running in the next day or so...

I hope that helps!

The sever is in our lab . so there is no doubt for me to call multiple threads . I rerun the command and find the multiple threads were just used in a very short time at the beginning. I wonder whether it is reasonable. I hope my job will finished with out error soon~

Blockquote

Hi ucassee,

can I ask, out of curiosity of mine, if your sequences are from Illumina? If so, 2x300bp?
The reason is that I can not figure it out how you can use a ‘-p-trunc-len 380’: is not this going to discard all the sequences? If not, because did you merged the sequences elsewhere? If so, you should use deblur instead of dada2.
Luca

Thanks to you reply. I use Miseq (2x300bp) to sequence my data. I use PEAR to merge my pairends sequences before qiime2. After demuxing my data, I find "-p-trunc-len 380" is the suitable parameter to retain most of may data. Why I can't use DADA2 to denoise my data in this situation?

Hi,
because any merger tool change the quality profile on the merged region, and therefore the underline error profile, @Nicholas_Bokulich may be more precise on this. The indication is to use deblur plug in in this case, https://docs.qiime2.org/2018.11/tutorials/read-joining/#deblur .
If this may explain why the process is taking so long I can not tell.

Luca

1 Like

Hi Luca,
I tried to analyze my merged data sequenced from MISEQ (2x300bp) with Deblur . But I find it denoise nearly 70% of my data . That is why I use dada2.

I gets it wrong , I used HISEQ (2x250bp) to sequence my data this time, and it has been runing over 48 hours at data2 step and still don't finish.

Blockquote

Hi,

Overall, 48hrs is still not an unreasonable time length. For myself I don’t expect anything less than that for denoising with dada2 (I use ‘–p-n-thread 0’ so any available processor will be used), but then depend on the complexity of the samples.

Luca

Hi,

Are you remember the multiple threads running status of dada2. Even if i use "–p-n-threads 50 " to run, it runs with only one thread in most time except for the beginning of the job. I don't know whether there is some mistake.

Yingli

@llenzi is correct — dada2 should not be used on pre-joined reads, since joining disrupts the error profile that dada2 uses to predict sequencing errors. This is only an issue for dada2; deblur can operate on joined reads.

70% loss is not unusual, and dada2 may well yield similar results. You should examine the output file to determine where these reads are being lost. It could indicate that many of your reads are shorter than the truncation length you chose; are chimeric; or are non-target DNA.

Yes something does not sound right here. As far as I can tell, multithreading is used at most stages of the dada2 pipeline so your job status is not consistent with expectations. Can you give us more information about your server? (number of cores, operating system, RAM, etc?)

For the same demux.qza file, I try to use Deblur and dada2 to handle. The result is 30% loss with dada2 but 70% loss with deblur.

The information of my server is 80 cores, Linux version 3.10.0-327.el7.x86_64 and the RAM is 2T. But I use the same command to run other smaller data, multithreading is used in most stage and it will finished in a shorter time. And the subsequent workflow is also normal.

Hi,
I find some few sequences contain N in this data. I wonder whether N will cause the abnormal of multithreading of the dada2?

So dada2 was successful in the end? That is good to hear.

I suspect your multithreading issue may have more to do with how your server is configured and/or if you were requesting more threads than were currently available (e.g., if other processes were running), causing your dada2 run to lag.

I doubt it, though I am not sure what effect this will have on dada2.

The filtered experience is from my other data processing. I find the reason for multithreading issue in this data . Some sequences in this data is replicated. so even the dada2 is running, it can't call multithread and don't report an error. When I deleted the replicated sequences, dada2 was finished in nearly 5 hours when I use 50 thread to run. So I suggest you add some checking mechanism before dada2.

Hi @ucassee,
Interesting report. Could you please provide a minimum working example (e.g., make a small dataset containing only the sequences that cause problems, and confirm that that filtered dataset still produces the same error) and the commands used to reproduce that error? This will allow us to test locally.
Thanks!

It is okay~ i still don't understand why dada2 can't be used to denoises joined reads. Could you give me some detail information about that?
By the way, I find in your tutorial deblur also remove much more reads than dada2 .That is why I always use dada2 to denoise my data. But it seems unreasonable.
sample-metadata.tsv (250 Bytes)
test.fastq (397.9 KB)

Thanks @ucassee. Could you please also provide me with the command that generates this error with these sequences? Please also send the QZA.

Hi Nicholas,
I find if I make a small dataset, multithread seems work regularly and dada2 can finished in a short time. but the double reads in data are still not checked out.<a classsample-metadata.tsv (337 Bytes)
="attachment" href="//forum-qiime2-org.s3.dualstack.us-west-2.amazonaws.com/original/2X/0/059a6346c1858762efaa9d4a33c0b8c2e4163d44.tsv">sample-metadata.tsv (337 Bytes)

Could you please clarify? I am not sure what you mean by checked out. Also, do these duplicate reads have duplicated sequence IDs? Or only the sequence is duplicated?

If you can't generate a minimum working example and command, I will not be able to troubleshoot locally.