I'm currently working on classification of my COI dietary dataset using the MIDORI database. I posted about this previously when working on a different COI dataset where I was having the issue of low resolution and inaccurate assignments of species that did not occur in the geographic area and low confidence scores when using the Bayes method.
For this current dataset I tried both the Bayes and BLAST (97% identity) classification methods to compare what taxa were identified. For Bayes, I also followed the suggestion and did not trim the database to the primers before training the classifier. This did seem to improve the classification results for Bayes. The BLAST and Bayes methods identified the same species for the most part. However, there were some instances where the bayes method identified some species that the BLAST did not, and vice versa. I did also run into some of the same issues with Bayes as before, with species being classified that did not occur in the correct geographic area with lower confidence scores. I did not see the geographically incorrect species with the BLAST method.
I'm wondering what others experiences are with using the Naive Bayes classification method for the COI region. Why might I be seeing some differences with what species are being identified with the two methods? Does the Bayes classifier behave differently depending on what region is used? I am also wondering what could be causing species to be classified that are not even in the correct geographic range when using Bayes, is this something anyone else has experienced? Hopefully I explained this well, I am trying to decide if I should combine both methods for my species identification.
The methods do work in very different ways, so it is not totally surprising that they give different answers.
Parameters also impact the performance, so you might want to try adjusting these to see how they impact your results. The default settings in QIIME 2 demonstrated the best performance in a benchmark of these classifiers by @devonorourke but this was with different databases, so performance might vary for MIDORI and you may want to try some other confidence and consensus threshold settings: https://doi.org/10.1002/ece3.6594
No, not really, though parameter settings may need to be adjusted for different regions and databases to adjust performance for those data.
Yes this can definitely happen, esp. if the database has limited coverage of a certain clade. You would hit the nearest neighbor in that clade, which might not be present in that geographic range. So we have seen this quite a bit with COI. So the results require some careful inspection. The solution we developed for this (with 16S data) is to weight the database by the probability of observing a given species in your samples, but we have not tested this with COI data, so this is definitely a complicated method to use here: https://www.nature.com/articles/s41467-019-12669-6
This is also possible! The RESCRIPt plugin has a method to merge taxonomies — it allows you to create a consensus from two or more taxonomic classifications. It will reduce your taxonomic resolution but this is maybe what you want.
Thank you for providing this information! I will try out some difference confidence and consensus values.
One thing I am confused about with this is why is the species level identification more accurate at a 0.3 confidence level rather than the default of 0.7 as found here? I would figure that reducing the confidence level would result in less accurate or misidentified species in my case since I am already seeing that at 0.7, but maybe I am thinking about it wrong.
"Confidence" is not a great term. Technically speaking, these are the probability scores for a given taxonomic label, vs. all other labels, summing to 1.0. So confidence=0.7 is actually quite high when a large number of labels is considered (e.g., for species classification), as it implies that all other classes have a very low probability. Especially when other close hits are present in the database, as is common for taxonomic classification of most markers (hence why 0.7 is the default and not, say, 0.99... we would always like to be super confident! but not when summing raw probabiltiy scores ).
If the confidence threshold is not met, the classifier then attempts classification at a shallower taxonomic rank (e.g., genus, family, etc). So lower confidence can be a good thing in some classification settings, to avoid underclassification to a shallower rank.
Importantly, we landed on 0.7 as the default for this classifier, based on 16S and ITS classification. COI is a different situation, and evidently for COI lower confidence improves accuracy, presumably by decreasing underclassification
I appreciate your help with this! I ran the classification again at 0.3 confidence. It didn't seem to benefit me at all and resulted in some more classification of species that don't occur in the geographic range. In this dataset I'm assessing diet, preferably to species level, so I want to make sure I'm being cautious with what species are being identified. I'm wondering if in my case it makes sense to have a higher confidence value or set a confidence threshold when interpreting the taxa identified.
Yes it is definitely worth a try. As you know which species are not found in your study region this is a good marker to use for evaluating what is a reliable confidence level for your study.
The confidence=0.3 threshold evaluated in that study was not taking into account endemic regions, it was based on classification of an artificial community. So your case is quite special and different then.