We used the GreenGenes classifier to get a taxonomic breakdown for some samples, but we would also like to see an ANI/identities breakdown to get a better idea of the criteria it is using.
Is there any way to export/print this information from Qiime2? Also, is there a way to export the cutoff parameters used for a classifier (ex: the ANI cutoff for genus vs species)?
What plugin command did you use for classification? The sci-kit-learn command doesn't use ANI... but vsearch sort of does. Once we know the specific command, we can explore how it works and what cutoffs it uses.
I used the classify-sklearn command, so it looks like there’s no way of getting the ANI. I would love more background about how it works though! I was trying to find a paper or discussion about it earlier but didn’t have any luck.
So, the official reference is the Scikit-learn paper with a whopping 16,000 citations.
A better ref is probably this:
And just to really answer your question
The plugin provides a default method which is to extract k-mer counts from reference sequences and train the scikit-learn multinomial naive Bayes classifier, and it is this method that we test extensively here. Specifically, the pipeline consists of a sklearn.feature_extraction.text.HashingVectorizer feature extraction step followed by a sklearn.naive_bayes.MultinomialNB classification step. The use of a hashing feature extractor allows the use of significantly longer k-mers than the 8-mers that are used by RDP Classifier, and we tested up to 32-mers.
By which @colinbrislawn means the classify-consensus-vsearch (and also classify-consensus-blast) methods in q2-feature-classifier. These perform vsearch or blast-based searching against a reference database, then least common ancestor taxonomic classification. Both have percent identity parameters, which allow you to specify how similar your query must be to the reference (e.g., 97% similarity) to be accepted as a hit. So it does not answer your question squarely — you cannot report the actual ANI — but you can set a threshold for how close your hits need to be.