I am dealing with a dataset that after importing using the fastq manifest option results in a 17GB .qza file. To properly process it I am using qiime vsearch join-pairs and qiime quality-filter q-score-joined both of which can't take advantage of multiple threads but are ideally suited to do so.
Even on a high-compute node to perform those two steps is taking hours.
I am a little surprised by the lack of emphasis in QIIME2 on efficient processing (e.g., parallelization or multi-threading) given the ever increasing size of datasets researchers are producing and would like to process with something like QIIME2.
I'm sorry it's taking so long, there are a lot of things in QIIME 2 that still need to be optimized. We've been very focused on making the functionality accessible first.
As some background:
We're a pretty small team of developers and we can only do a handful of things at once. QIIME 1 was a massive codebase replete with features, knobs, and dials. Reproducing those steps (while adding modern additions) in a completely new architecture with new interfaces and project structure has been a lot of work. This necessarily comes at the cost of other nonfunctional requirements like performance.
One of the things we realized early on is that with modern ASV processing pipelines, there are significantly fewer features/OTUs than in a typical clustering pipeline in QIIME 1. This means that even if you have lots of reads, the downstream statistics and visualizations are usually quite fast. The only real bottleneck is the ASV method itself, and we've found its runtime is usually really good (@benjjneb has recently done some really awesome work to speed this up even more).
This meant that we knew we could defer some of the performance optimizations until later on, so while performance was very important it wasn't the biggest fish in the pond.
Unfortunately that only applies to downstream methods and visualizations. Methods like join-pairs and q-score-joined are both very recent additions, and happen before the ASV methods. (I suspect you are planning on using deblur?)
This means performance was overlooked and should be improved. Thank you so much for letting us know! Now we know where to direct our optimization efforts. We really rely on this forum for understanding what needs improvement and what it is our users need.
It does look like we should be passing --threads along to vsearch (I've opened a new issue here).
As for q-score-joined, that may take some investigation as there is actually very little CPU needed to perform that, so more often than not you will be IO-bound for that particular task, which means no amount of threading will improve the situation. We can't be sure without some benchmarks, so I've opened an issue to explore that.
Hey there,
I understand that the development team is rather small, but, as I learned from my experience, faster processes lead to faster testing which, in turn, yields faster development progress (and, of course happier users).
If you are open to pull requests, I'll be more than willing to help and implement mp acceleration where it is critically needed.
We are always open to pull requests (of all kinds)!
I think one of the first steps is figuring out where MP can be effectively applied. I don't think we have anything tracking this right now. So I would suggest we use this thread until a better scheme can be found. (Open to suggestions!)