When using Kraken, I am able to find 19 of the 20 species within the community. However, when using Qiime2 and the four Naive Bayes classifiers trained on Silva and Greengenes, I am only able to identify a few of the species.
Any idea how to improve accuracy?
What primer set are you using? You will want to make sure that the classifier is trained on the same subdomain for optimal accuracy.
You can reduce the --p-confidence setting to increase recall at the expense of precision (in other words, you will get more species-level classifications)
These classifiers are designed to be quite conservative, i.e., reduce false-positives at the expense of species-level classification. So seeing mostly genus-level classifications on 16S fragments with default settings is quite normal, and is designed that way to avoid errors.
Note that mock communities are generally made up of well-characterized, easy-to-classify species, so adjusting confidence parameters to achieve perfect species-level classification on mock communities is usually quite achievable... in this paper we showed that we could get perfect classification of some mock communities with any classification method by adjusting the parameters one way... but those settings often proved fatal for classification of more difficult-to-classify species elsewhere on the tree of life. This is how we landed on the conservative default settings in q2-feature-classifier.
So feel free to adjust confidence — we even have recommendations in that paper linked above for improving recall — but be aware that better species-level classification on a mock community is sometimes an indicator of an "overly confident" classifier that can make mistakes!
Looks like you are targeting 341-805 (V3-V4) — so this could explain part of your issue (the V4 pre-trained classifiers will perform poorly since they only cover half of your amplicons; the full-length classifiers are sub-optimal; a V3-V4 classifier would be best)
Those are the adapters + primers, so contain some non-biological sequence that you want to exclude when extracting reads for training a primer. So your example command would not work — you would not extract any reads because the adapter portion would not match anything.
I have just the thing for you! See this topic:
You can follow those directions to train your own V3-V4 classifier... and please let me know what you find!
download the full gg_13_8 release (check out the data resources page at qiime2.org, the full database is linked below the pre-trained classifiers that you used)
unzip
the file will be located here: gg_13_8_otus/rep_set/99_otus.fasta
Those instructions could be used with any 16S reference database... so you could use SILVA or Greengenes or GTDB, or the best comparison to Kraken would be to use the same exact reference database that was used to train Kraken.
Hi,
I am still not able to classify very well the mock community. Several wrong genera are identified.
Here is my entire pipeline, I used this reference file:
gg_12_10 is ancient, 13_8 is also quite old but contains various updates to taxonomy, etc. I recommend using 13_8 at the very least, but trying out the latest release of SILVA or GTDB would be better. The best thing, if you are comparing this against kraken or other methods, is to use the exact same reference database for all methods, so that you know that differences are due to the method, not the database.
Zero confidence is akin to top-hit blast; the classifier will just grab the closest hit without considering other similar hits. This is good for testing but not the most accurate method (results may vary: it depends on the contents of the mock community and the reference database but in general confidence=0 is a poor choice)
This is beginning to sound quite suspicious and I wonder if some technical error is afoot. Can you tell me a little more about your sequencing protocol. Do you know if the reads are in mixed orientations? Could you share a QZV barplot of your results?
Excellent! That's more in line with expected performance. What species were missing and what false positives did you get? Note that GTDB has an altered taxonomy (to better reflect the molecular phylogeny), so any species you are missing just might have been renamed... you can search for the missing species on the GTDB site to see if they were renamed.
Yep, I seem to recall those two being quite similar
Looking at GTDB, it looks like a number of S. aureus sequences were renamed S. argenteus in GTDB, so this could be a nomenclature issue. The others are probably misclassification (e.g., due to sequencing error on the other Staph isolates)
No I believe the source data are all slightly different. GTDB's sequences are from Genbank refseq. SILVA's are from EMBL. Greengenes I am not sure, probably Genbank as of 7 years ago.
BUT GTDB should have all the same taxonomic groups (even if some of them are re-named).
No clue — I suggest reading the information on the GTDB site, there is probably a README with that information.