Optimizing `qiime feature-classifier classify-sklearn`

nick-youngblut · November 30, 2017, 3:49pm

I'm running q2cli version 2017.10.0 on Ubuntu 16.04.3. When I run qiime feature-classifier classify-sklearn on a dataset consisting of ~40k sequence variants with silva-119-99-515-806-nb-classifier.qza for the classifier, the job takes many hours even though I'm using --p-n-jobs 20. I'm wondering why the default --p-reads-per-batch is 262144. Given the high reads-per-batch, it seems that multi-threading (--p-n-jobs) won't actually be used unless --p-reads-per-batch is reduced to something reasonable (eg., 1000 reads per batch). For instance, with my dataset of ~40k sequence variants, there would only be one batch. When I run qiime feature-classifier classify-sklearn on my dataset, only one thread is really used even though I use --p-n-jobs 20. I've tried using --p-reads-per-batch 1000 --p-n-jobs 20, which will result in all 20 threads, but processing ~40k reads still takes hours.

My questions are: 1) why is the default --p-reads-per-batch so high? 2) why does qiime feature-classifier classify-sklearn take so long, even when using many threads?

Nicholas_Bokulich · November 30, 2017, 5:24pm

Hi @nick-youngblut,

Thanks for posting! Sorry to hear that the classifier has been giving you some grief. It sounds like you have a large number of query sequences — are these OTUs or sequence variants? What reference database are you using? Out of curiosity, what type of samples are you analyzing?

This is what is proposed in this forum post to achieve meaningful parallelism.

@BenKaehler may have some insight into the rationale behind this setting.

Short answer: 40k is a very large number of features and will take a long time to run with any method.

Long answer: it is true, the naive bayes classifier implemented by RDP is faster (though the gap narrows as more reference sequences are added) because it is written in Java. We have worked to optimize accuracy over speed, because we find that methods like dada2/deblur tend to greatly reduce feature counts by weeding out spurious observations, and thus downstream runtime steps (including sequence classification) are greatly reduced, to the extent that parallelization isn't even necessary. Query sequences are only likely to run into the tens of thousands with 1) very large studies or 2) OTU picking without aggressive quality control, so classification runtime hasn't been so much of an issue for most users.

Is your concern that parallelization is not reducing runtime to the expected degree?

nick-youngblut · November 30, 2017, 6:30pm

Thanks for the quick response! The classifier database that I'm using is silva-119-99-515-806-nb-classifier.qza. My dataset consists of only 2 MiSeq runs (2x250) of human feces microbiome data. I generated sequence variants (SV) with dada2 on each MiSeq run independently, then merged the sequence variant tables. The resulting SV table contains ~40k features.

I'm going to soon process a feces microbiome dataset consisting of >3200 samples spanning dozens of MiSeq runs, and I'm worried that taxonomic classification is just not going to scale well (at least, when using qiime feature-classifier classify-sklearn). Of course, the number of SVs shouldn't be exponentially or even linearly expanded from my current 2-MiSeq run dataset, but still, it will be a good number more SVs than 40k.

I've been doing a lot more metagenomics than 16S work lately, but given that classifiers like centrifuge can process millions of reads very rapidly, I'm surprised that qiime feature-classifier classify-sklearn takes so long. I don't remember the RDP Classifier taking so long, but it's been a while since I've used it.

I just wanted to make sure that I was running qiime feature-classifier classify-sklearn with the optimal settings for --p-reads-per-batch. The same goes for --p-pre-dispatch, although I don't really understand that option.

colinbrislawn · November 30, 2017, 8:06pm

Have you considered using one of the other feature classifier in Qiime 2? There is the vsearch classifier which does a search followed by Last Common Ancestor (LCA) inference. It's fully multithreaded and may be much faster.

https://docs.qiime2.org/2017.11/plugins/available/feature-classifier/classify-consensus-vsearch/

I also think that search + LCA is much easier to explain and defend to traditional biologists then the bayesian classifier, but that's just me.

BenKaehler · December 2, 2017, 7:53pm

@nick-youngblut

--p-reads-per-batch was set very high to effectively turn that feature off unless you want to use it. At the moment load balancing is the responsibility of the user. There is an open ticket to at least warn users if --p-reads-per-batch is too high for their selected number of processes. We could extend this perhaps to make --p-reads-per-batch automatic by default, as it does seem to cause a lot of confusion.

Your run times do seem very long though. Does your machine have enough memory to run 20 processes concurrently?

@colinbrislawn

Sadly, our testing (which will be published soon) indicates that classify-consensus-vsearch scales in the same way as classify-sklearn with the number of query sequences, so making that switch may not be a quick fix.

I speculate that classify-consensus-vsearch may be less hungry for memory, though, so if memory is the bottleneck it may help. I haven't checked that though.

I can't really comment on whether vsearch is easier to explain to any individual than naive Bayes, but I should correct your post in that the naive Bayes classifier is not a "Bayesian" technique, at least not in the sense of the Bayesians vs Frequentists feud. It is a machine learning technique. Sorry for being pedantic but people get funny about these sorts of things and misconceptions have a way of sticking.

colinbrislawn · December 2, 2017, 8:58pm

Thanks for the feedback Ben. I'm familiar with how the RDP classifier works, but I have to admit that I have not read up on classify-sklearn.

Who knew!

nick-youngblut · December 3, 2017, 12:20am

I'm using a server with 1TB of memory, and classify-sklearn seems to be using 10s of GB for the jobs I'm running, so memory doesn't seem to be the issue.

Thanks for clarifying the "Bayes" versus "Bayesian"!

A member of my lab is working on a sequence classifier based on fourier transformation. Based on his testing, it's ~100% accurate down the the genus level and can process 1000s of 16S V4 sequences in a couple of minutes (if not faster). Hopefully, he will publish it within the next few weeks. Maybe it could be added as an alternative to the naive Bayes classifier method (or vsearch method).

Nicholas_Bokulich · December 3, 2017, 12:31am

Thanks @nick-youngblut! Your runtimes are still sounding off — I wonder if parallelization is not functioning correctly. Our benchmarks found that the naive bayes classifier could classify 2611 seqs/min, though this was with a smaller reference database (greengenes rather than SILVA — SILVA takes a good deal more time and memory though I have not benchmarked with this so don't have solid runtime estimates). Even if we assumed that SILVA would take 10X runtime, that would come out to be 150 min on a single job. Do you have the exact runtimes for 20 jobs vs. 1 job with your 40k seqs?

Certainly — we aim to implement an array of the best open-source methods in QIIME2 so if this classifier is as good as you say, it fits the bill. We would need it benchmarked against the methods currently in QIIME 2 (this would be very straightforward if your colleague uses our test datasets and evaluation framework). You can have your colleague contact me directly and we can discuss further.

nick-youngblut · December 5, 2017, 12:24pm

Your runtimes are still sounding off

Here's my runtime, memory, and CPU usage for a classify-sklearn job on 33,591 sequence variants (SVs generated with dada2):

Job 731519 (taxonomy) Complete
User             = nyoungblut
Queue            = long.q@node443
Host             = node443
Start Time       = 12/05/2017 09:29:24
End Time         = 12/05/2017 13:06:03
User Time        = 2:01:55:44
System Time      = 00:14:55
Wallclock Time   = 03:36:39
CPU              = 2:02:10:40
Max vmem         = 49.206G
Exit Status      = 0

The job was an SGE job run on a compute cluster with Ubuntu 16.04.3. The qsub job script was:

#!/bin/bash
#$ -N taxonomy
#$ -pe parallel 20
#$ -l h_vmem=5G
#$ -l h_rt=24:0:0
#$ -o /ebio/abt3/nyoungblut/SGE_out
#$ -j y
#$ -cwd

CONDA_INSTALLATION="/ebio/abt3_projects/software/miniconda3"
QIIME2_ENV="qiime2"

export PATH="$CONDA_INSTALLATION/bin":$PATH
export PATH="$CONDA_INSTALLATION/envs/$QIIME2_ENV/bin":$PATH
export LC_ALL=C.UTF-8
export LANG=C.UTF-8

qiime feature-classifier classify-sklearn   \
  --i-classifier /ebio/abt3_projects/databases/leylab16s/classifiers/silva-119-99-515-806-nb-classifier.qza  \
  --i-reads rep-seqs_merged_r5k.qza   \
  --o-classification taxonomy_r5k.qza  \
  --p-n-jobs 20   \
  --p-reads-per-batch 1000

If haven't done a direct comparison between silva-119-99-515-806-nb-classifier.qza and gg-13-8-99-515-806-nb-classifier.qza, so I'm not sure if GreenGenes would be much faster.

Nicholas_Bokulich · December 5, 2017, 2:18pm

Thanks @nick-youngblut! Yeah those runtimes are a couple orders of magnitude slower than my benchmarks (single job, greengenes). I've opened this issue to sort this out. Thanks for reporting!

Greengenes is something like a quarter the size of SILVA, so would be a good bit faster but that's probably not the only cause of this runtime disparity.

Nicholas_Bokulich · December 6, 2017, 5:05pm

@nick-youngblut, I have benchmarked runtime performance with SILVA vs. greengenes and it seems that your long runtimes are almost entirely due to the size of SILVA. The SILVA classifier takes as much as 30X longer to classify as the Greengenes classifier! (I only tested on a small test set because SILVA is taking too long, but we can extrapolate that this will be even more pronounced in a dataset of your size)

A rough comparison of classification accuracy between SILVA and Greengenes on mock community data indicates that Greengenes actually provides better recall and precision than SILVA at species level, though SILVA does better at genus level. (That comparison is a bit old, uses the prior SILVA release, and is in need of updating/improvement but at least as a rough comparison indicates that greengenes and SILVA perform similarly).

This is in no way an endorsement of one reference database over another, but merely pointing out that if you have a very large set of query sequences and runtime is a priority, then greengenes will provide similar accuracy with much less runtime. On small datasets, runtime with SILVA will be much less important.

I hope that helps!

nick-youngblut · December 6, 2017, 8:25pm

I thought Greengenes wasn't maintained anymore. According to Balvočiūtė and Huson 2017, the latest release of Greengenes was back in 2013, while SILVA is 2016 (or maybe there's been a release this year?). Will Greengenes ever be updated again, or is the 2013 release the final release?

wasade · December 6, 2017, 8:32pm

Hi @nick-youngblut, it is accurate that we have not released an update of Greengenes in a few years. We are still intending to update it. I do not have a time estimate on hand unfortunately.

thermokarst · December 22, 2017, 5:54pm

QIIME 2 2017.12 is now out, and includes an autotuner for reads-per-batch.