I’m hoping to get a better understanding of the vsearch classifier. When I perform a search with default parameters, it’s my understanding that:
- for a given query sequence, it is considered a match if it aligns with >= 97% identity to some reference sequence;
- the search process for that same query sequence continues across all other reference sequences;
What I’m wondering about next is two other default parameters:
--p-min-consensus. With the default
--p-maxaccepts set to 10, I’m wondering how the program determines what those 10 values should be? Are they ranked in terms of highest percent identity match? Are they the first 10 entries in the list (I wouldn’t think you’d bother searching the entire reference db if you’re just collecting the first 10)?
Can someone also please articulate how the
--p-min-consensus is considered? Is this value considering the consensus across all possible query matches (above %id threshold), or is is just among the 10 selected?
As I’ve been playing around with these settings it’s seemed like there is a tradeoff with certainty and consensus - as I increase any of the parameters listed here (
--p-min-consensus) I think I’m more confident that the subsequent match is most accurate, but it seems that I often lose taxonomic information. For example, for any one particular sequence, if I search using defaults I might generate taxonomic info to the Species level, but then if I broaden my number of
--p-maxaccepts to 1000 (instead of default 10) I get the same sequence matching with taxa info to just the Genus level.
It would seem to me that if I’m after the strongest consensus parameters possible, I’d require a percent match of 100% and retain all possible matches before building consensus. Perhaps that’s a mistake.
I’d appreciate understanding how the default parameters influence the subsequent analyses microbiome folks generally consider. For instance, if I was going to collapse taxa at some level, I think it would be important to know how much you trust your taxa assignments! So you probably don’t trust species assignments across all cases, but perhaps you are quite confident in your Genus or Family level assignments?