Dada2 Denoise for Large Data Set - Running 7 days - Too Long?

reige012 · March 14, 2019, 2:53pm

Hello,

I've been running a HiSeq data set 2x250bp with a forward and reverse read file each ~138GB. I know this is a huge chunk of data and there isn't much info on QIIME2 runs with this type of data set that I can find. But I'm currently on the dada2 denoise step and it has been running on our supercomputer cluster with 16 threads for just shy of 7 days.
Command used:
qiime dada2 denoise-paired --i-demultiplexed-seqs demux.qza --p-trim-left-f 0 --p-trunc-len-f 250 --p-trim-left-r 0 --p-trunc-len-r 250 --p-chimera-method consensus --o-representative-sequences rep-seqs-dada2.250.qza --o-table table-dada2.250.qza --p-n-threads 16 --o-denoising-stats denoising-stats-dada2-250.qza

I can't really figure out how to know exactly where the command is in its progress and I stupidly did not pass the --verbose command. I did find a temp file that says the following:

Filtering The filter removed all reads: /ddnB/work/areige1/Ch1Qiime/temp/tmpwqvw9sli/filt_f/G33MB.B_455_L001_R1_001.fastq.gz and /ddnB/work/areige1/Ch1Qiime/temp/tmpwqvw9sli/filt_r/G33MB.B_455_L001_R2_001.fastq.gz not written.
Some input samples had no reads pass the filter.
..........................................................................................................................................................................................................................................................x............................................................................................................................................................................................................
Learning Error Rates
2a) Forward Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 237205 reads in 56239 unique sequences.
Sample 2 - 287159 reads in 70043 unique sequences.
Sample 3 - 50801 reads in 21595 unique sequences.
Sample 4 - 164914 reads in 32588 unique sequences.
Sample 5 - 329611 reads in 48024 unique sequences.
selfConsist step 2
selfConsist step 3
selfConsist step 4
selfConsist step 5
Convergence after 5 rounds.
2b) Reverse Reads
Initializing error rates to maximum possible estimate.
Sample 1 - 237205 reads in 51137 unique sequences.
Sample 2 - 287159 reads in 64957 unique sequences.
Sample 3 - 50801 reads in 20618 unique sequences.
Sample 4 - 164914 reads in 32320 unique sequences.
Sample 5 - 329611 reads in 54306 unique sequences.
selfConsist step 2
selfConsist step 3
selfConsist step 4
Convergence after 4 rounds.
Denoise remaining samples .............................................................................................................................................................................................................................................................................................................................................................................................................................

Does this mean that it's still Denoising? I can see that the average load per thread is changing every so often so I'm assuming that the command is working, but is it possible its stuck in some type of loop and just lingering forever?

I do have limited time available on the supercomputer and thus trying to figure out what my options are if I can't get this job to finish in the next 96 hrs. Does this seem like an excessive amount of time to be running dada2 for this much data?

Thank you!
Alicia Reigel

Mehrbod_Estaki · March 14, 2019, 8:17pm

Hi @reige012,
It looks like DADA2 is carrying on as it is expected. It has created the error models and it is now denoising your samples. The remaining steps would be merging and chimera removal. Unfortunately there's no way to know how much longer there is left and given that you have a massive dataset, this timeline I would say is totally expected with DADA2.
At this point I would just let it carry on as long as you can and hope for the best. In the future, it might be worth having a look here tutorial for big data. I know that in previous Qiime2 versions, DADA2 performed much slower than its native counterpart in R. I'm not sure if that is still the case or not but if you can't finish the run in the allocated time you are assigned you may try again in R directly. Note that multithreading of DADA2 pipeline is currently only supported with mac/linux computers as a heads up.

system · April 15, 2019, 2:17am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.