Hi @mengpf0409,
These are great questions!
This issue can be partially explained as outlined here, here, here and here. That is, the more strict your --p-perc-identity 0.9
setting, the fewer reference sequences will match. This will result in a smaller pool (this is tied to the --p-min-consensus
parameter) from which to calculate the lowest common ancestor (LCA) consensus taxonomy. Thus resulting in a consensus taxonomy that is more broad, or at at higher taxonomic rank.
When using --p-perc-identity 0.8
, you are allowing the retention of more hits to the reference sequences, increasing the pool of taxonomic information that can contribute to the LCA consensus taxonomy, again tied to the --p-min-consensus
parameter. Kind of a "majority-rule" approach.
Keep in mind, for some datasets & primer-pair choice, it is not unusual for short-read amplicon data to be unable to classify taxa to the genus level. In some cases the more specific classification is the result of incorrect over-classification. That is, returning a more specific identification than it should be able to.
Also, feature-classifier classify-sklearn
works a differently than feature-classifier classify-consensus-blast
. I'd highly recommend reading the following papers: