Duel of BLAST+ & Vesearch Classifiers!

Hi there,
I used the same command in both Vsearch and Blast+ classifiers to classify my data. I expected to get the same result but surprisingly Blast classification was so better than Vsearch classifier.

qiime feature-classifier classify-consensus-blast
–i-query rep-seqs.qza
–i-reference-reads refseqs.qza
–i-reference-taxonomy reftaxonomy.qza
–p-maxaccepts 10
–p-strand both
–p-perc-identity 0.8
–p-unassignable-label unassigned
–o-classification classification.qza
–verbose

and

qiime feature-classifier classify-consensus-vsearch
–i-query rep-seqs.qza
–i-reference-reads refseqs.qza
–i-reference-taxonomy reftaxonomy.qza
–p-maxaccepts 10
–p-strand both
–p-perc-identity 0.8
–p-unassignable-label unassigned
–o-classification classification.qza
–verbose

In the forum, I found a topic. He has said “I know Vsearch should do a global alignment, while Blast is a local alignment tool”. I did not understand what do global and local alignment mean! Could you please enlighten up it?

Report: Blast processed it less than 30 min, but Vesearch did it more than 12 hours. In contrast, Vesearch showed query matching percentage (e.g. over 70%) at the end, but Blast did not. The both classification .qza files’ size were so similar, more than 800 Mb.

Why they showed huge differences in these results, as you see? Which one do you offer to use? Which one is widely used and popular? Do they have special applications that make them unique? What is their usage?

Qiime2 version: 2020/08
Gene type: Functional

Thanks
Qiimer

Hi,
as far as I know, the main difference is that a local aligner just tries to align a portion of the query sequence and tries to extend the alignment; on the contrary, a global aligner tries to align the whole query sequence. Still, if there are too many differences at the 5’ or at the 3’ of the primary alignment, those regions may be soft-clipped, namely unaligned (decreasing the query coverage). In practice, I don’t know if this implies a huge difference is expected in the results, probably not.
Regarding the results, I think that the differences in the processing time could be explained by the default value of --p-maxrejects set to ‘all’. That value may be set to a lower value (e.g. 100), in order to limit the number of sequences in the database, presorted based on k-mers in common with the query that are aligned to the query, as suggested in this post .
For example, you may try out these parameters:

--p-maxaccepts 100 \
 --p-maxrejects 100 \
 --p-maxhits $MAX_ACCEPTS \

Also, setting such a low --perc-identity coupled with MAX_ACCEPTS=10 probably results in top hits belonging to very diverse taxa, causing a high-level consensus taxonomic assignment. Since you are setting such a low identity value, you are aligning Nanopore reads, right?

2 Likes

Thanks for the feedback. I am still thinking of and trying to digest your words! I am sure I have a problem in understating the concept of three parameter functions:

–p-maxaccepts 100
–p-maxrejects 100
–p-maxhits $MAX_ACCEPTS \

I think one of the main reasons of more unaligned reads and being a time-consuming process lie in the three parameters. I already increased the value of maxaccepts to 20., and percentage identity to 90. The result was higher unassigned reads. I executed the command with Blast classifier because it provides a fast result.
Can they improve the higher unclassified reads rate?

*I took advantage of Illumina sequencing technology, HiSeq 2500 System.

Qiimer

I run Vesearch with modified commands rather than the previous one.

qiime feature-classifier classify-consensus-vsearch
–i-query refseqdsrB3.qza
–i-reference-reads seqdatabase.qza \
–i-reference-taxonomy taxdatabase.qza \
–p-maxaccepts 100 \
–p-perc-identity 0.8 \
–p-maxrejects 100 \
–p-maxhits 1
–p-strand both
–p-unassignable-label unassigned
–output-dir classification.qza
–verbose

It is around 15 min in 1% search! I expected it would be improved but no development is made so far. I will share the classification result with you later.

Qiimer

After 12 hours with the modified command it is the result! Just 17%! It is beyond of my expectation!
I got stuck!
I have no any clue how to optimize and improve the result.

Qiimer

Have you checked that all your sequences have PCR primers? Which database are you using? Have you already performed merging of reads from the same fragment (if you sequenced in paired-end mode and reads overlap to each other) and denoising/OTU picking? That may speed up the analysis considerably, since you would be assigning taxonomy only to representative sequences.

I did not remove primer pair from my sequences.
I am using a functional gene, dsrB.
Yes, I merged the both seq and tables.
merging reads are done. The merged percentage starts from 55 to 80, after using DADA2. No Problem these steps! I passed them very carefully.
I got a good result with Blast classifier, but I have not achieved a good one from Vesrach so far.
Qiimer

The processing time of Vsearch would benefit of increasing the number of threads, by specifying a higher value for --p-threads parameter (default is 1). In contrast, Blast is not multi-threaded, since QIIME2 doesn’t support the indexing of the db for the Blast search, which you would perform with makeblastdb command. The high amount of unassigned reads or reads assigned to high taxonomy level would lead me to question whether the database you are using is complete enough (thus requiring such a low min identity value for having not so many unclassified reads). Moreover, the --p-strand both parameter may not be necessary, since Illumina fragments should all start with the fw primer and end with the rev. comp. of rv primer, right? To check that, before dada2, you may perform PCR primers trimming:

qiime cutadapt trim-paired \
--i-demultiplexed-sequences sequences_untrimmed.qza \
--p-cores 24 \
--p-front-f $FW_PRIMER \
--p-front-r $RV_PRIMER \
--p-discard-untrimmed \
--o-trimmed-sequences sequences.qza

A part from this, it may be that Vsearch actually works worse than Blast, in a case where you have low-identity matches.

Thanks for the comment!
I need to ask a few questions regarding the three parameters:

–p-maxaccepts
–p-perc-identity
–p-maxhits \

In the tutorial section, there are some explanations for each, but I could not get them completely. Could you please tell me what are their functions in a plain way? They are confusing for me! maxaccept and maxhit work together but I do not know what are their differences and similarities in practice when they working in a command, and how we can choose a value for identity percentage? what factors maybe affect on it? Does its value affect on maxhit and maxaccept parameters and vice versa?
If I am not mistaken, percentage identity defines similarity cut-off, right? If yes, the default value is 80% which is low for 16S! The threshold is 97%, mostly I have seen.

Thanks a lot.
Qiimer

Hi, this is my understanding of the search process performed by classify-consensus-vsearch.

The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. Then, for each query, target sequences are considered in the order produced by the k-mer-based sorting, and one after one they are pairwise aligned to the query sequence. The first alignment may not pass the --perc-identity and --p-query-cov criteria. In that case, the target sequence counts as a “reject”, and the search process stops after --p-max-rejects are found for that query. On the contrary, if the criteria are met, the target sequence is considered as an “accept”. Moreover, if you set --p-strand-both true and the reverse complement of the query sequence passes the --p-perc-identity and --p-query-cov criteria for another target, both the targets are considered as “accepts”. The search process continues for a query until up to 2* --p-max-accepts are found (the 2 accounts for the two strands). Then, by using the --p-maxhits parameter you can retain only a subset of the --p-max-accepts hits for a query, namely those with the highest similarity (e.g. alignment identity). At this point, the taxonomy of up to --p-maxhits hits is considered, and the taxonomy level at which at least --p-min-consensus of the hits agree is assigned to the query.

I agree the 80% similarity value may be too low for Illumina sequencing, you could consider increasing it up to 97% (or using a Naive-bayes classifier trained on your database). In case you want to try this strategy for comparing results, you may be interested in using these scripts.

1 Like

Thank you very much for the informative reply and your time!

In naive classifier there is one optional input --i-class-weight. What exactly is this? And If I ignore it, what will be exported in the end while no input for my query? And how can I use the output generated with this classifier? I think it does not make sense! Please clarify it! I mean in comparison with Veasreach and Blast+. it looks wierd!
By the way, there are many parameters for this classifier without any explanations!!! I think it gets worse and worse! :face_with_thermometer:

https://docs.qiime2.org/2020.11/plugins/available/feature-classifier/fit-classifier-naive-bayes/

I run only this as a practice. It warns!!!

qiime feature-classifier fit-classifier-naive-bayes \

–i-reference-reads seqdatabase.qza
–i-reference-taxonomy taxdatabase.qza
–o-classifier naive.qza
–verbose

qiime2-2020.8/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.23.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)

I used fit-classifier-sklearn too. But it gave an error!

qiime feature-classifier fit-classifier-sklearn
–i-reference-reads seqdatabase.qza
–i-reference-taxonomy taxdatabase.qza
–p-classifier-specification text
–o-classifier naive.qza
–verbose

qiime2-2020.8/lib/python3.6/site-packages/q2_feature_classifier/classifier.py:102: UserWarning: The TaxonomicClassifier artifact that results from this method was trained using scikit-learn version 0.23.1. It cannot be used with other versions of scikit-learn. (While the classifier may complete successfully, the results will be unreliable.)
warnings.warn(warning, UserWarning)
Traceback (most recent call last):
File “/home/rabiei/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/q2cli/commands.py”, line 329, in call
results = action(**arguments)
File “”, line 2, in fit_classifier_sklearn
File “/home/rabiei/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 245, in bound_callable
output_types, provenance)
File “/home/rabiei/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/qiime2/sdk/action.py”, line 390, in callable_executor
output_views = self._callable(**view_args)
File “/home/rabiei/miniconda3/envs/qiime2-2020.8/lib/python3.6/site-packages/q2_feature_classifier/classifier.py”, line 131, in fit_classifier_sklearn
spec = json.loads(classifier_specification)
File “/home/rabiei/miniconda3/envs/qiime2-2020.8/lib/python3.6/json/init.py”, line 354, in loads
return _default_decoder.decode(s)
File “/home/rabiei/miniconda3/envs/qiime2-2020.8/lib/python3.6/json/decoder.py”, line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “/home/rabiei/miniconda3/envs/qiime2-2020.8/lib/python3.6/json/decoder.py”, line 357, in raw_decode
raise JSONDecodeError(“Expecting value”, s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Plugin error from feature-classifier:

** Expecting value: line 1 column 1 (char 0)**

See above for debug info.

Although it warned, I got the taxonomy data from naive classifier, but I do not know what is it usage? It did not help me out!

Screenshot from 2020-12-19 18-22-29

Thanks @MaestSi
Qiimer

The warning may be probably due to the fact of using two different qiime2 versions, could this be the case?

From Bokulich et al., “In machine learning, class weights or prior probabilities are vectors of weights that specify the frequency at which each class is expected to be observed (and should be distinguished from the use of this term under Bayesian inference as a probability distribution of weights vectors). An alternative to setting class weights is to assume that each query sequence is equally likely to belong to any of the taxa that are present in the reference sequence database.” The latter is what you do if you don’t specify --i-class-weights. In very simple words, when using a uniform prior probability (namely no assumption is made on the taxonomic composition of the sample), a Naive-Bayes classifier is trained by evaluating the probability of the occurrence of all k-mers in each sequence of the reference database. Then, the taxonomic classification of a new query sequence is obtained by identifying the k-mers of the query sequence, multiplying the probability of those k-mers to occurr in each sequence of the database, and to classify the query as the sequence of the database with the highest probability value. But for doing this, I would suggest first extracting only the portions of the database that are informative for your sequences based on the PCR primers you used, otherwise you are going to confound the classifier. The classifier you obtained is meant to be used as an input to the qiime feature-classifier classify-sklearn command (for the --classifier parameter, in particular). You can do all these steps with the scripts I suggested you to use, if you think it would be interesting to check it out.
Simone