I’m running q2cli version 2017.10.0 on Ubuntu 16.04.3. When I run qiime feature-classifier classify-sklearn on a dataset consisting of ~40k sequence variants with silva-119-99-515-806-nb-classifier.qza for the classifier, the job takes many hours even though I’m using --p-n-jobs 20. I’m wondering why the default --p-reads-per-batch is 262144. Given the high reads-per-batch, it seems that multi-threading (–p-n-jobs) won’t actually be used unless --p-reads-per-batch is reduced to something reasonable (eg., 1000 reads per batch). For instance, with my dataset of ~40k sequence variants, there would only be one batch. When I run qiime feature-classifier classify-sklearn on my dataset, only one thread is really used even though I use --p-n-jobs 20. I’ve tried using --p-reads-per-batch 1000 --p-n-jobs 20, which will result in all 20 threads, but processing ~40k reads still takes hours.
My questions are: 1) why is the default --p-reads-per-batch so high? 2) why does qiime feature-classifier classify-sklearn take so long, even when using many threads?
Thanks for posting! Sorry to hear that the classifier has been giving you some grief. It sounds like you have a large number of query sequences — are these OTUs or sequence variants? What reference database are you using? Out of curiosity, what type of samples are you analyzing?
This is what is proposed in this forum post to achieve meaningful parallelism.
@BenKaehler may have some insight into the rationale behind this setting.
Short answer: 40k is a very large number of features and will take a long time to run with any method.
Long answer: it is true, the naive bayes classifier implemented by RDP is faster (though the gap narrows as more reference sequences are added) because it is written in Java. We have worked to optimize accuracy over speed, because we find that methods like dada2/deblur tend to greatly reduce feature counts by weeding out spurious observations, and thus downstream runtime steps (including sequence classification) are greatly reduced, to the extent that parallelization isn’t even necessary. Query sequences are only likely to run into the tens of thousands with 1) very large studies or 2) OTU picking without aggressive quality control, so classification runtime hasn’t been so much of an issue for most users.
Is your concern that parallelization is not reducing runtime to the expected degree?
Thanks for the quick response! The classifier database that I’m using is silva-119-99-515-806-nb-classifier.qza. My dataset consists of only 2 MiSeq runs (2x250) of human feces microbiome data. I generated sequence variants (SV) with dada2 on each MiSeq run independently, then merged the sequence variant tables. The resulting SV table contains ~40k features.
I’m going to soon process a feces microbiome dataset consisting of >3200 samples spanning dozens of MiSeq runs, and I’m worried that taxonomic classification is just not going to scale well (at least, when using qiime feature-classifier classify-sklearn). Of course, the number of SVs shouldn’t be exponentially or even linearly expanded from my current 2-MiSeq run dataset, but still, it will be a good number more SVs than 40k.
I’ve been doing a lot more metagenomics than 16S work lately, but given that classifiers like centrifuge can process millions of reads very rapidly, I’m surprised that qiime feature-classifier classify-sklearn takes so long. I don’t remember the RDP Classifier taking so long, but it’s been a while since I’ve used it.
I just wanted to make sure that I was running qiime feature-classifier classify-sklearn with the optimal settings for --p-reads-per-batch. The same goes for --p-pre-dispatch, although I don’t really understand that option.
Have you considered using one of the other feature classifier in Qiime 2? There is the vsearch classifier which does a search followed by Last Common Ancestor (LCA) inference. It’s fully multithreaded and may be much faster.
--p-reads-per-batch was set very high to effectively turn that feature off unless you want to use it. At the moment load balancing is the responsibility of the user. There is an open ticket to at least warn users if --p-reads-per-batch is too high for their selected number of processes. We could extend this perhaps to make --p-reads-per-batch automatic by default, as it does seem to cause a lot of confusion.
Your run times do seem very long though. Does your machine have enough memory to run 20 processes concurrently?
Sadly, our testing (which will be published soon) indicates that classify-consensus-vsearch scales in the same way as classify-sklearn with the number of query sequences, so making that switch may not be a quick fix.
I speculate that classify-consensus-vsearch may be less hungry for memory, though, so if memory is the bottleneck it may help. I haven’t checked that though.
I can’t really comment on whether vsearch is easier to explain to any individual than naive Bayes, but I should correct your post in that the naive Bayes classifier is not a “Bayesian” technique, at least not in the sense of the Bayesians vs Frequentists feud. It is a machine learning technique. Sorry for being pedantic but people get funny about these sorts of things and misconceptions have a way of sticking.
I’m using a server with 1TB of memory, and classify-sklearn seems to be using 10s of GB for the jobs I’m running, so memory doesn’t seem to be the issue.
Thanks for clarifying the “Bayes” versus “Bayesian”!
A member of my lab is working on a sequence classifier based on fourier transformation. Based on his testing, it’s ~100% accurate down the the genus level and can process 1000s of 16S V4 sequences in a couple of minutes (if not faster). Hopefully, he will publish it within the next few weeks. Maybe it could be added as an alternative to the naive Bayes classifier method (or vsearch method).
Thanks @nick-youngblut! Your runtimes are still sounding off — I wonder if parallelization is not functioning correctly. Our benchmarks found that the naive bayes classifier could classify 2611 seqs/min, though this was with a smaller reference database (greengenes rather than SILVA — SILVA takes a good deal more time and memory though I have not benchmarked with this so don’t have solid runtime estimates). Even if we assumed that SILVA would take 10X runtime, that would come out to be 150 min on a single job. Do you have the exact runtimes for 20 jobs vs. 1 job with your 40k seqs?
Certainly — we aim to implement an array of the best open-source methods in QIIME2 so if this classifier is as good as you say, it fits the bill. We would need it benchmarked against the methods currently in QIIME 2 (this would be very straightforward if your colleague uses our test datasets and evaluation framework). You can have your colleague contact me directly and we can discuss further.
Here’s my runtime, memory, and CPU usage for a classify-sklearn job on 33,591 sequence variants (SVs generated with dada2):
Job 731519 (taxonomy) Complete
User = nyoungblut
Queue = [email protected]
Host = node443
Start Time = 12/05/2017 09:29:24
End Time = 12/05/2017 13:06:03
User Time = 2:01:55:44
System Time = 00:14:55
Wallclock Time = 03:36:39
CPU = 2:02:10:40
Max vmem = 49.206G
Exit Status = 0
The job was an SGE job run on a compute cluster with Ubuntu 16.04.3. The qsub job script was:
@nick-youngblut, I have benchmarked runtime performance with SILVA vs. greengenes and it seems that your long runtimes are almost entirely due to the size of SILVA. The SILVA classifier takes as much as 30X longer to classify as the Greengenes classifier! (I only tested on a small test set because SILVA is taking too long, but we can extrapolate that this will be even more pronounced in a dataset of your size)
A rough comparison of classification accuracy between SILVA and Greengenes on mock community data indicates that Greengenes actually provides better recall and precision than SILVA at species level, though SILVA does better at genus level. (That comparison is a bit old, uses the prior SILVA release, and is in need of updating/improvement but at least as a rough comparison indicates that greengenes and SILVA perform similarly).
This is in no way an endorsement of one reference database over another, but merely pointing out that if you have a very large set of query sequences and runtime is a priority, then greengenes will provide similar accuracy with much less runtime. On small datasets, runtime with SILVA will be much less important.
I thought Greengenes wasn’t maintained anymore. According to Balvočiūtė and Huson 2017, the latest release of Greengenes was back in 2013, while SILVA is 2016 (or maybe there’s been a release this year?). Will Greengenes ever be updated again, or is the 2013 release the final release?