classify-consensus-blast VS classify-consensus-vsearch

MaestSi · January 21, 2020, 8:36am

Dear all,
I am comparing classification output of Nanopore reads from a mock community by classify-consensus-blast to classify-consensus-vsearch, as I read on the forum that they should provide similar results. I am running them using the most similar set of parameters, namely:

qiime feature-classifier classify-consensus-blast
--i-query rep-seqs.qza
--i-reference-reads $DB
--i-reference-taxonomy $TAXONOMY
--p-perc-identity 0.77
--p-query-cov 0.3
--p-maxaccepts 1
--o-classification taxonomy.qza

and

 qiime feature-classifier classify-consensus-vsearch \
    --i-query rep-seqs.qza  \
    --i-reference-reads $DB \
    --i-reference-taxonomy $TAXONOMY \
    --p-perc-identity 0.77 \
    --p-query-cov 0.3 \
    --p-top-hits-only \
    --p-maxaccepts 1 \
    --p-strand 'both' \
    --p-unassignable-label 'Unassigned' \
    --p-threads $THREADS \
    --o-classification taxonomy.qza

However, I found 9% (Blast) VS 56% (Vsearch) Unassigned rate, which is a huge difference. I know Vsearch should do a global alignment, while Blast is a local alignment tool, however, using 30% query coverage I would have expected to get similar results. Moreover, would it be possible in a future release to expose the -num_threads parameter for Blast?
Thanks in advance,
Simone

Nicholas_Bokulich · January 21, 2020, 5:25pm

Hi @MaestSi,
That is a bit surprising. What do the query seqs look like?

More importantly, what does the taxonomy look like? VSEARCH and BLAST behave a little bit differently — notably, maxaccepts=1 with VSEARCH will actually report 2 hits if strand=both, and if these two hits have totally different taxonomies (or the taxonomy is not multi-level), you will get "unassigned". The maxhits parameter will be added in the next release to control this behavior if you are trying to do a top-hit VSEARCH (which it looks like you are doing).

In the meantime if you think that's the issue you could run VSEARCH directly to diagnose.

I opened an issue here — contributions are very welcome!

MaestSi · January 22, 2020, 9:33am

I uploaded the taxonomy barplots I obtained with the 2 classifiers.
taxa-bar-plots_Blast.qzv (362.5 KB) taxa-bar-plots_Vsearch.qzv (332.7 KB)

So, possibly, that is the explanation for what I am finding. Basically, I am looking for 2 hits (one for the fw strand and one for the reverse strand) for each query, and if one of the two doesn't match the database, I get an Unassigned. This may be the reason for > 50% Unassigned, right? Good that maxhits parameter will be added in the next release!

Thank you also for this!
Simone

Nicholas_Bokulich · January 22, 2020, 3:21pm

Yes, this seems like a likely cause, especially since you are setting coverage and % identity so low it seems quite possible.

MaestSi · February 29, 2020, 10:17am

Hi @Nicholas_Bokulich, I tried out Vsearch and it seems to work much better now. For sure it is faster and the percentage of Unassigned reads is reduced below 50%, due to --p-maxhits 1.
I would like to ask a few questions. Since I set --p-maxhits to 1, is the consensus taxonomy assignment still performed among all --p-maxaccepts hits? I saw slightly different results when setting --p-maxaccepts to 1 or to 100, so I guess it is doing something, but I didn't see the reduction in species-level assignments that I saw with Blast after modification of the same parameter. Moreover, is the --p-top-hits-only parameter doing something when --p-maxhits 1 is set? I did a quick test and I saw no difference at all.
Thanks in advance,
Simone

Nicholas_Bokulich · February 29, 2020, 3:44pm

Hi @MaestSi, thanks for testing out the new params!

No, when setting maxhits to 1 q2-feature-classifier's consensus classification would not occur (it would just find the ID of that top hit but perform no consensus), because there are not multiple hits to find consensus from!

maxaccepts still operates in this case: maxaccepts determines how many seqs to accept before stopping the search, so effectively how many seqs go on to the alignment step... maxhits will then take the top N of these. From the VSEARCH manual:

Maximum number of hits to accept before stopping the search. The default value is 1. This option works in pair with --maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If --maxaccepts is set to a higher value, more hits are accepted. If --maxaccepts and --maxrejects are both set to 0, the complete database is searched.

Yep, because BLAST does not have a separate maxhits parameter to adjust how many alignments to keep, so maxaccepts determines how many hits are returned.

No, if you have 1 hit and take only top hits then you only have 1 to choose from!

In practice these new parameters allow us to fine-tune operation of q2-feature-classifier, but in slightly different ways:

top-hits-only allows you to increase max-accepts to do search/alignment across a larger number of candidate seqs, but then only do consensus classification among those that tie for top hit. This is my favorite new option.
maxhits add some subtle functionality. While previously maxaccepts was the main way to control both how many searches were aligned and how many alignments were tossed to consensus, maxhits separates these operations so, e.g., you can accept more searches to check the alignment, then pass only the top N to q2-feature-classifier for consensus. ADDITIONALLY, maxhits fixes the issue I described above (that maxaccepts accepts N seqs in each direction when strand=both, so can cause trouble with mixed-orientation data) and allows unambiguous top-hit searching when the need arises.

I hope you enjoy the new params! Does this address your functionality needs? Or do you still need BLAST multithreading?

MaestSi · February 29, 2020, 5:53pm

I surely enjoy the new parameters, and they address most of my functionality needs! Still, it would be nice to enable consensus taxonomy assignment by Vsearch also in case of mixed orientation data. In fact, if I understood correctly, with mixed orientation data I am obliged to set maxhits=1, and this automatically excludes the possibility to do consensus taxonomy assignment. I understand Vsearch should work even better than BLAST, but still I think BLAST multi-threading would be an interesting feature for Nanopore users, as that is the aligner used by the "official" EPI2ME 16S workflow.
Thanks for all your kind explanations,
Simone

Nicholas_Bokulich · February 29, 2020, 9:25pm

You can! These improvements are in part to address this need.

Not at all! You have it the other way around; rather, the addition of maxhits to this method addresses the consensus classification pseudo-"bug" that was caused by having mixed-orientation seqs.

maxhits allows you to explicitly specify how many reads you want to use for consensus, vs. how many hits you want to consider (previously, maxaccepts did both but if strand=both, one hit was taken from each direction and that would spell trouble if those hits passed perc-identity and made it into the final hits but belonged to different taxonomic groups)

For example, say you want to consider the 10 best hits for consensus classification. With maxaccepts=10 that works fine if your seqs/ref are not in mixed orientations, but with mixed orientations you could end up with 20 hits. So in that case you can instead set maxaccepts=10 (or any number if you want to consider more!) and also maxhits=10... with strand=both you will pull 10 accepts from each direction = 20, but only the 10 best will be taken. This fixes the issue where, e.g., you get 10 accepts in each direction and perc-identity is set low enough that many of these pass but only the reads in one direction are actually good matches... only the best hits are taken.

The top-hits-only and weak-id options are other additional routes to control this behavior, and are worth exploring to optimize your pipeline. But all in all I believe these new options fit the bill... give it a spin and please let me know if you have other ideas for improvement!

Understood! That issue remains open and we will probably fix at some point, but it is low priority for us at the moment (only 2 users have asked for this feature, and the sense I get is that classify-consensus-blast is not commonly used). One day...

MaestSi · March 1, 2020, 7:24am

Now I finally got it. It took me a while but it was worth it!

I verified it works perfectly also for the consensus taxonomy assignment, so thank you again. If I have any other ideas/requests I'll let you know!
Simone