Performance and running time of classify-consensus-blast+

steff1088 · November 27, 2017, 2:19pm

Hi everybody,

I was trying to use the BLAST+ consensus taxonomy classifier to run six 16S datasets and I was unsure about a couple of things.

First, the running time seems to be quite high with the configurations I am using. Here my inputs: My query sequences are quite big with ~60 MB - I am not sure if that might pose a problem?
The reference is the 97_otus file from SILVA and I am applying the consensus_taxonomy_all_levels set from SILVA. I have run the command and it has not concluded after 3 days yet. Second, and that might be related to the long running time, the configurations I am using differ from a recommendation deemed to be optimal I found in this great source: Optimizing taxonomic classification of marker gene amplicon sequences [PeerJ Preprints]

I am using:
--p-perc-identity 0.97
--p-min-consensus 0.99
--p-maxaccepts 1 \

But the authors recommend for optimal performance of the classifier:
--p-perc-identity 0.8
--p-min-consensus 0.51
--p-maxaccepts 10 \

How do these configuration parameters influence the running times individually? Unfortunately, the rate given in the paper does not apply for my configuration.
There is no option of multi-threading with this classifier and I really like the solid performance evaluated in the study for consensus-blast+. Also, I would like to keep the 97% identity threshold.. Any helpful comments would be more than great!

Cheers,
steffen

Nicholas_Bokulich · November 27, 2017, 2:32pm

Hi @steff1088,
Thanks for posting! (and thanks for checking out that pre-print first to compare runtime results and parameters)

I have not actually benchmarked the effect of individual parameters on runtime, but anecdotally speaking this should not significantly impact runtime.

A large number of sequences will result in a linear increase in runtime, so this of course may be part of the delay. The bigger issue (and what I suspect may be the problem here) is that a larger reference database (e.g., SILVA) will consume much more memory and increase runtime. How much memory is available on your machine, and can you check current memory usage?

Have you considered the classify-consensus-vsearch classifier? As you will know from reading that preprint, its behavior is a little bit different from blast+ but achieves fairly similar performance under most conditions. More importantly here, it does support multi-threading if you think that you are being CPU-bound rather than memory-bound.

That benchmark was based on greengenes, not SILVA, so the slope and intercept should be quite different (especially if you are using longer querysequences). You could perform a quick benchmark, though, to extrapolate your expected runtime (if this is not a memory issue): measure classification runtime with 1, 10, 1000, 10000 query sequences. The slope should be linear, in which case you can predict the runtime for your full set of query sequences.

I hope that helps!

steff1088 · November 29, 2017, 8:46pm

Hi @Nicholas_Bokulich,

thanks for coming back to me with these very helpful comments. I basically spent the last days following up on your suggestions. The vsearch classifier seems to be a good option and I am running it currently on 5 threads (also using SILVA's 7 level taxonomy instead of all). Though, it has not concluded yet after 2 days.. Now another post made me think that something else is wrong with my data: (Merged sequence 100%OTU picking - #3 by thermokarst). Our Illumina data was merged with Casper and demultiplexed, quality-checked outside of q2. So, in the post I learned that paired-end pre-joined sequences cannot be used in the qiime 2 workflow prior to 2017.11 is that right? Might that explain my large amounts of sequences and the inability of the classifier to conclude in reasonable time? Also, I am having trouble with alignment using MAFFT as well which is quickly running out of memory (just in case this might be related).

What would be the best way forward in order to account for the pre-merged status of my sequence data using the new 2017.11 release?

Thanks so much!
steffen

Nicholas_Bokulich · November 29, 2017, 9:18pm

They should not be passed to dada2, which needs to operate on quality scores from forward and reverse reads separately. deblur and OTU picking (with q2-vsearch) are fine, as far as I know, because they do not deal with the quality scores.

How many sequences do you have? That does explain everything (very long classifier runtime, MAFFT issues). Note that doing demultiplexing and dada2 in QIIME2 might result in fewer sequences, as many of the OTUs that you are detecting in your current pipeline may represent spurious sequences (even if you are using dada2, pre-merging is probably throwing off its error detection).

Keep an eye on the post you linked to, and any other related posts on the forum — that post is "pending development", meaning an update will be given once the new release is available.

I hope that helps!

steff1088 · November 29, 2017, 9:50pm

After dereplicating with vsearch dereplicate-sequences and closed-reference clustering by vsearch cluster-features-closed-reference (SILVA's 97_otus) at 97% identity I looked at the output tables of 3 sample sets (containing 10 samples respectively) which I am each using as input for the vsearch classifier. Here the sequence and feature numbers:

Sample set 1
5,795 features
559,580 sequences

Sample set 2
2,919 features
733,891 sequences

Sample set 3
7,338 features
299,312 sequences

The rep-seq.qza files coming out of the dereliction are 60-80 MB big. I dont have the individual quality scores for the samples and so I can't use DADA2 to reduce the sequence amount potentially. In the moving pictures tutorial I can read that Deblur does apply quality-filtering based on quality scores, which would disqualify my samples from using it. I will be looking out for the new release and follow the post as you suggested!

Thanks!

Nicholas_Bokulich · November 29, 2017, 10:13pm

Wow, I haven't seen so many features since the last time I used OTU picking . dada2/deblur will not necessarily reduce the number of features (some users see an increase because these methods essentially detect 100% OTUs, minus erroneous sequences), but with this many features my guess is that you would see a large reduction. But I could certainly be wrong — what kind of samples are you analyzing?

Or you could do things the way we used to and apply an abundance filter to remove rare OTUs, which are most likely to be spurious sequences. This can be done with filter-features. Even just removing singletons and doubletons would probably cut your sequence counts in half and save you heaps of trouble without really affecting beta diversity results! (I'd strongly recommend this approach for OTU picking)

I believe deblur uses a static error profile, not actually the Q scores from each sequence, so you should be able to use it with your merged reads — @wasade can you pls confirm? @steff1088 if this is true, would you mind reporting (in a new post) the tutorial sentence that says otherwise? And we will get that fixed.

Now I'm a little confused — and maybe I misinterpreted this — but it sounds like you have 3 different sequencing runs that you are analyzing together, but you are picking OTUs separately? Assuming there would be lots of shared features between these sample sets, clustering together would also reduce the total number of features.

I hope that helps!

wasade · November 29, 2017, 10:43pm

That's correct, Deblur does not use PHRED scores but it is assumed the data have been quality filtered (e.g., q2-quality-filter). Deblur processes samples independently when --p-min-reads is set to 1; combining sample sets or processing the sample sets separately will produce an identical result. The --p-min-reads parameter simply drops low abundance OTUs across the samples at the time of execution, and if setting it to 1 for combining sample sets, you may want to apply a low abundance filter as @Nicholas_Bokulich suggests after merging the processing runs.

Best,
Daniel

jairideout · November 30, 2017, 7:10pm

The QIIME 2 2017.11 release has expanded support for analyzing paired end reads! See the paired end reads community tutorial for more details.

steff1088 · December 2, 2017, 5:01am

Thank you very much for your support @Nicholas_Bokulich @wasade . The samples are 16S Illumina reads and I am running subsets to test the workflow first.

I used qiime feature-table filter-features to remove doubletons and singletons and my sequence numbers shrunk drastically. Now, before I would apply this command to all my samples I wanted to make sure my overall workflow seems right:

I imported already merged, pair-end, and quality-filtered data in .fna format.
Dereplication by qiime vsearch dereplicate-sequences. This resulted in a feature table and dereplicated sequences.
I would integrate qiime vsearch cluster-features-open-reference with q2-2017.11 which would generate a clustered table and clustered sequence file. Here a question: Could I theoretically use taxa barplot here already along with SILVA's otu.qza as taxonomy file?
To remove the singletons and doubletons, I applied qiime feature-table filter-features after which I downloaded the .csv file of the frequency per feature filtered table. Then, I reformatted this file (.tsv) and used it as input for the qiime feature-table filter-seqs command resulting in a filtered sequence file. Here I used for both commands the clustered table and sequence file from the open-reference clustering, is that right? I did this instead of running through deblur.
Using qiime feature-classifier classify-consensus-vsearch to classify against 16Sonly_consensus_taxonomy_7_levels.qza from SILVA. As reference read inputs I used the otu.qza file as in the cluster step. My input query was the dereplicated, OR-clustered, filtered sequence file, that decreased significantly in terms of sequence reads by now.

To move on, I would basically use the output taxonomy and the filtered feature table file to make taxa barplot, correct? I can run this pipeline and get some sequence/feature numbers to check, I just wanted to verify this workflow beforehand in order to avoid making a naive mistake somewhere.

Nicholas_Bokulich · December 4, 2017, 4:32pm

Hi @steff1088,
Your workflow looks correct to me; there is also a community tutorial on OTU picking here that may be useful in case you have not already seen it.

I think not, since this is open reference and many of the OTUs will have no associated SILVA ID/taxon. If you used closed reference, that should be possible.

You are using the currently recommended workaround but just FYI soon that feature table could be passed to filter-seqs directly. See this issue to track progress.

correct

Yes, as long as these OTUs correspond to the 16Sonly_consensus_taxonomy_7_levels.qza.

Correct! and all other downstream analyses.

I hope that helps! Let us know if you run into any trouble (probably in a separate thread).

Happy QIIMEing

thermokarst · December 22, 2017, 5:56pm

QIIME 2 2017.12 is now out and includes the ability to use a feature table for filtering with filter-seqs!