I'm a new Qiime2 user and I've been working on a dataset which has 22 million PE reads across 3 samples. And I've been running the whole qiime2 pipeline on 16 nodes with 128gb RAM. The analysis wasn't done for almost a week and hence when I looked at the logfile, it was just at dada2 step and it wasn't finished at all. Do I wait for more time for it to get done since I've seen in dada2 documentation that millions of reads across less samples usually take long time to process. I wanted to subsample my dataset but my reads are non-demultiplexed so I'm not sure how I can ~3million reads with not much variation in distribution across samples or ~million reads per sample. Could you please suggest a quick solution to this problem. Thanks in advance.
Can you clarify what you mean by subsample your dataset? And are your reads demultiplexed at the dada2 step? As in how have you imported your reads/demultiplexed prior to dada2. I ask because dada2 requires samples to be demultiplexed before denoising.
Are you invoking the --p-n-threads parameter in your dada2 command to utilize multi-threading? This is internal to q2-dada2 and different than what you request from your grid. Without setting this, the default will be to use 1 core no matter how many are available, this would take a very long time with such a big dataset.
Ff you need further troubleshooting please make sure to also copy & paste the exact commands you are running + qiime2 version, and in most cases adding the --verbose in your commands can help with giving more detailed error messages (thought you aren't receiving any errors per se here, just for future reference)
Since its taking longer time with 22 million reads I wanted to run the analysis on smaller subset with ~2-3 million reads. For the same reason I wanted to subsample the reads from initial mutliplexed data that I give as input to the qiime2 pipeline.
To be honest, I'm not sure why it would take this long, from my experience dada2 shouldn't take that long for 22 mil reads, there may be something else going on here, perhaps one of the devs can comment. Which version of qiime2 is installed here? I know that the most recent 2019.4 version had huge performance enhancements in the q2-dada2 plugin making it magnitudes of order faster.
As for testing on a smaller subsample that's not a bad idea. A simpler idea might be using the demux subsampled-paired plugin after you import your demultiplexed reads, instead of sumsampling the multiplexed reads.