Optimizing `qiime feature-classifier classify-sklearn`

Hi @nick-youngblut,

Thanks for posting! Sorry to hear that the classifier has been giving you some grief. It sounds like you have a large number of query sequences — are these OTUs or sequence variants? What reference database are you using? Out of curiosity, what type of samples are you analyzing?

This is what is proposed in this forum post to achieve meaningful parallelism.

@BenKaehler may have some insight into the rationale behind this setting.

Short answer: 40k is a very large number of features and will take a long time to run with any method.

Long answer: it is true, the naive bayes classifier implemented by RDP is faster (though the gap narrows as more reference sequences are added) because it is written in Java. We have worked to optimize accuracy over speed, because we find that methods like dada2/deblur tend to greatly reduce feature counts by weeding out spurious observations, and thus downstream runtime steps (including sequence classification) are greatly reduced, to the extent that parallelization isn't even necessary. Query sequences are only likely to run into the tens of thousands with 1) very large studies or 2) OTU picking without aggressive quality control, so classification runtime hasn't been so much of an issue for most users.

Is your concern that parallelization is not reducing runtime to the expected degree?

2 Likes