Understanding naive bayes+sklearn classifier, top-hit identity distribution

marymcelroy · May 27, 2023, 4:21pm

Okay, this is great info, Nicholas! Thank you so much for these insights. Truly, I am learning a lot from this exchange. Yes, I was concerned with the low level of species representation in my regional ref dbs, which I created by filtering the global ref dbs based on a regional species list. Maybe it would be better to use the global dbs and add class weights to regional species there?

If ref db representation is better at higher taxonomic ranks, could I be more confident in those classifications using either global/regional dbs, esp. if I'm setting high --p-confidence values (like 0.90 or 0.95)? How does the nb classifier decide if a genus classification is better than a species annotation, for ex? Does it calculate post probabilities for all ranks in the ref db and not just for species? I read this post, and it seems maybe to answer my question. That is, if the species level annotation doesn't meet the --p-confidence threshold, then it will sum the probabilities for all the species from the genus in the training set and so on through higher ranks until the threshold is exceeded. Do I have that right?

Does the risk of over/misclassification of novel queries come from situations where my training sets give high post probabilities for certain species that would potentially exceed my confidence threshold and wrongly assign novel seqs to species rather than a higher rank? I assume this is more of a risk with my regional ref dbs.