Are blastn top hits ordered by sequence identity?

If I am using one of the consensus-based classifiers, and I set maxaccepts to 1, will it give me the best match? I suspect that it is arbitrarily giving me one of the taxa that passes the sequence identity cutoff. I am trying to identify a certain species that is not in the Greengenes database, so I manually added it to the sequence and taxonomy files (using a 16S sequence from NCBI). I am still only getting a genus-level match.

1 Like

This parameter is passed directly to blastn as the max_target_seqs parameter. My assumption has always been that the best alignment is reported, but this does not seem to be explicitly documented in the blastn manual. Googling seems to suggest that that is indeed what max_target_seqs does (report the N top hits) but this parameter does not work precisely how most folks think and that may or may not be a bug in blastn.

This could be because there are other equally good hits in the reference database (i.e., your query sequence may be a perfect match to a reference sequence). If two sequences have the exact same score and maxaccepts=1, one is probably chosen at random but I do not actually know (since that is blastn behavior, not yet part of the consensus assignment done in qiime2).

The genus-level hit is probably because the sequence that is being chosen as the top hit does not have a species-level annotation (e.g., in greengenes would be annotated like this: ā€œā€¦g__Genus;s__ā€). That is an issue with the database annotations, not with anything the classifier is doing and there are some other posts on the forum that describe these annotations in more detail.

You could also try classify-consensus-vsearch ā€” vsearch's maxaccepts behavior is much more transparent:

Maximum number of hits to accept before stopping the search. The default value is 1. This option works in pair with --maxrejects. The search process sorts target sequences by decreasing number of k-mers they have in common with the query sequence, using that information as a proxy for sequence similarity. After pairwise alignments, if the first target sequence passes the acceptation criteria, it is accepted as best hit and the search process stops for that query. If --maxaccepts is set to a higher value, more hits are accepted. If --maxaccepts and --maxrejects are both set to 0, the complete database is searched.

Your description of classify-consensus-vsearch is different from what I found here:

that quote is from the vsearch manual, not qiime2 documentation.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.