Understanding options for assigning taxonomy

I have been using the classify-consensus-vsearch option recently with PR2 and SILVA, but I'm confused as to how it differs from some past work I have done in BLAST.

In prior work (and in some other literature I have read), we have assigned taxonomy using blastn and then reviewed the output to check "percent identity". Anything >99% was assigned to species, 97% was collapsed to genus level, and so on.

Classify consensus does not give this kind of output. It is my understanding that I could increase "--p-perc-identity", but then I would lost any ASVs that can't be assigned to a lower taxonomic level. With a .7 --p-perc-identity, can I be confident that assignments to the species level are accurate?

I'd appreciate any explanations or links to resources. Thanks!

Hi @areaume ,

Good questions :grin:

Note that these % id thresholds are totally arbitrary and do not really reflect biological circumstances. These thresholds were originally based on full-length 16S similarities observed early on in some species, but there is much wider variation in reality, e.g., some species groups are 97% similar while others are significantly more or less... and so on. So for any given hit, these thresholds might not necessarily reflect the given taxonomic lineage.

classify-consensus-vsearch and classify-consensus-blast instead attempt to find consensus between the lineages of all hits to find the most likely lineage. You can check the citation (see qiime feature-classifier --citations) for an explanation and benchmark of the method.

It does — there is a "search results" ouptut (as of the last few releases of q2-feature-classifier). This can be viewed with the action qiime metadata tabulate if you want to use a "thresholding" approach to determine the lineage as you have done in the past, or if you just want to inspect the report.

Correct. I would not set the %id too high. You can instead adjust the maxhits parameter to adjust the number of search results used in the consensus (which will always be the N closest based on kmer frequencies), so this will "crowd out" any poorer hits. Or use the "top-hits-only" parameter...

It depends. If a hit actually only has 70% similarity to your query, then it is probably not the correct lineage. But unless if there was some technical error or you are sequencing very unusual samples (martian space dust?) it is very unlikely that the top hit will be only 70% similar to the query. And if it is, then it is even less likely to be the only hit. Chances are that you will hit a few different reference sequences within that %id and the consensus lineage will be determined from among these. You can always use the "search results" output to double check any suspicious taxonomic classifications to see what the reported hits were.

I hope that helps clarify!


Thank you @Nicholas_Bokulich for your thorough response! I'd been trying to wrap my head around this for a while and this really helps clarify things.