Long-running dada2 denoise

fvnieuwe · December 8, 2016, 5:39pm

Hi Thermokarst,

Thanks! I used your suggestions, and I was indeed able to import the per-sample paired-end data using:
qiime tools import --type 'SampleData[SequencesWithQuality]' --input-path ./../raw_data_renamedforqiime_R2removed/ --output-path ./raw-sequences.qza --source-format CasavaOneEightSingleLanePerSampleDirFmt
However, the denoise step 'hangs' on this (it gives one core at 100% for a day) using:
qiime dada2 denoise --i-demultiplexed-seqs raw-sequences.qza --p-trim-left 0 --p-trunc-len 290 --o-representative-sequences rep-seqs --o-table table
What could be the problem?

I also merged my paired-end data using PEAR. I renamed and imported as suggested. I am able to do the denoise step on this. Is this a valid approach?

Kind regards,
Filip

jairideout · December 8, 2016, 6:21pm

Hi @fvnieuwe: I suspect that your command isn't hanging, so much as it is just taking a really really long time to complete. Part of why it is taking so long is QIIME 2 currently runs dada2 in single-threaded mode. We are working on some upstream changes to support multi-threaded dada2 runs within QIIME 2, and we hope to have that ready by the next release (currently scheduled for Q1 2017).

If you want to confirm that dada2 denoise isn't actually hanging, you might try running it on a smaller dataset to confirm that it completes successfully. For example, you could import a couple of your samples into a new .qza file and try denoising that.

@gregcaporaso posted some options for working around the (currently very slow) dada2 denoise step. In addition to his suggestions, you could try running the underlying dada2 R tool directly, and then import your denoised data to continue analysis with QIIME 2.

Note that if your dataset was generated from multiple MiSeq runs, you'll want to use @gregcaporaso's third suggestion in that linked post. dada2 works best by denoising each MiSeq run separately, so you'll get better results in less time because you can denoise each MiSeq run separately in parallel and then merge the results.

I also merged my paired-end data using PEAR. I renamed and imported as suggested. I am able to do the denoise step on this. Is this a valid approach?

This approach is not recommended because dada2 needs the unjoined reads in order to produce the best results. Unfortunately we don’t have support for that hooked up yet; we expect paired-end dada2 support in the next release in addition to multithreading support.

For now, you’ll need to pick either R1 or R2 reads, import them into a .qza file, and denoise that. Another option is to join your reads (e.g. with PEAR, QIIME 1's join_paired_ends.py, or some other read-joining approach) and cluster/denoise the joined reads with a different tool that supports sequence data that has already been joined. For example, you could do all these steps in QIIME 1 (e.g. join_paired_ends.py, pick_open_reference_otus.py) and then import the resulting .biom file into QIIME 2.

Apologies that this process is a pain to work around right now -- the next QIIME 2 release should make this much easier. Let us know how it goes!

fvnieuwe · December 14, 2016, 5:05pm

Hello Jairideout,

The process was indeed not hanging. I tried to do the analysis with only one of my 63 samples: It finished without errors after several hours. The 63 samples dataset is still running ( for 5 days now).

I used your suggestion to generate the .biom file in qiime1 and import the table in qiime2: This seems to work.

Kind regards,
Filip

jairideout · January 19, 2017, 9:06pm

A post was split to a new topic: Does QIIME 2 support demultiplexing with usearch?

thermokarst · August 3, 2017, 1:07pm

An off-topic reply has been split into a new topic: DADA2: Anticipated runtime?

Please keep replies on-topic in the future.

cdevera · September 1, 2017, 2:08am

Hi all,

If you are wondering whether or not your dada2 denoise is running, you can open up another Terminal window, type 'top' and hit enter. This will show you what commands are being run. Look for 'R' in the 'Command' column and you'll be able to see the %CPU usage. You can also look down the "State" column to see which commands are "running" if you are having a difficult time finding the specific command.

Hope this helps!

thermokarst · May 14, 2018, 3:26pm

An off-topic reply has been split into a new topic: Long-running DADA2 issue

Please keep replies on-topic in the future.