I’ve been trying to run the feature-classifier extract-reads so I can prepare to train my classifier on some COI reference sequences. The problem I ran into is that it took over a week, and had not yet been completed. For previous versions of Qiime2, I noticed that some people could add an extra parameter ( --p-n-threads) to specify how many threads should be used. However, in trying to do that now (with my current version of Qiime2, qiime2-2019.1) I get an error stating: no such option: --p-n-threads.
Is this no longer a feature in the version I possess? How can I accommodate this?
Maybe you could make this multithreaded yourself, by dividing your FeatureData[Sequence] into several parts, then running extract-reads all at once on all those different parts?
That said, I suspect the reason this doesn’t have the n-threads is that it’s mostly IO-bound, so adding CPUs doesn’t make it read from the hard drive any faster, so I don’t know if splitting and merging is worth the effort here.
I did something to confirm whether or not this operation is IO-bound (to the best of my ability).
Using the command:
> sudo iotop
And I don’t observe that anything is being significantly used, and that using
> top
shows that nearly 100% of the CPU is being used. I’m taking this as an indication that it is CPU-bound. Is there somewhere I can confirm this in the documentation before I try splitting and then later merging my files?
I think this is solid evidence that this step is CPU bound. Good detective work!
Estimating bottlenecks is hard, which is why we don't usually mention how much RAM, CPU, or IO is needed for a specific step. So your first hand observation better than the best documentation