Question about threshold values for determining bacterial genera/species

ivandeetan · December 11, 2019, 8:49am

Hello! Apologies if this has been asked or answered before but I'm still confused after reading through potentially related threads... I've run QIIME2-2019.4 on Ubuntu and used the Greengenes database for classification (specifically this one: gg-13-8-99-515-806-nb-classifier.qza).

I'd like to know what the threshold the classifier used to identify or distinguish between bacterial genera/species. Does this mean taxonomic identification had a % similarity/identity threshold of 99% because of the classifier I've used?

As a follow-up, I understand that the Greengenes database has not been updated since 2013; is this why using the classifier failed to identify a lot of bacteria up to genus- and species-level?

Thanks in advance for taking the time to read this and answering my questions!

Nicholas_Bokulich · December 11, 2019, 3:29pm

If you use classify-consensus-vsearch you can choose a % similarity threshold because that method is based on global alignment followed by LCA consensus classification.

However, classify-sklearn is not based on alignment, it uses Naive Bayes classification based on kmer frequency profiles, so there is no % identity threshold.

The "99" in the filename you listed refers to how the reference sequences in that database were clustered (at 99% similarity). This is just for pseudo-replication of the database.

No, the reason for that is most likely because you are attempting to classify short DNA fragments; it looks like you are classifying V4 sequences, which often fail to classify at genus and species level when left to their own devices. This has nothing to do with the database release being 6 years old!

I hope that helps!

ivandeetan · December 12, 2019, 2:46am

Thank you very much for your answers, Sir!

I'd just like to clarify the following:

I did use "classify-sklearn". What would this mean for my analysis? Would my results be more accurate if I used "clasify-consensus-vsearch" instead?

The primers I used actually targeted the V3 and V4 regions of the 16S rRNA gene. This produced an amplicon with a size of >400 bp. Is this still insufficient to classify up to genus/species? If so, what would you suggest as a better target?

Once again, thank you for your time and assistance.

Nicholas_Bokulich · December 12, 2019, 2:51am

I recommend sticking with classify-sklearn. You can read about all of these methods and see some performance comparisons in this article:

Aha, so two things.

First, to answer your question: Yes, V3-V4 is a longer amplicon with a little more information. You still may not be able to get species level (essentially because different species have V3-V4 domains that are too similar to differentiate from each other), but this will be a bit better than V4.

But now the answer you really need: you are getting poor classification because you are using a V4 classifier on V3-V4 sequences... so the classifier is getting confused and not functioning properly! you should train your own classifier using the primer set that you used — see the training a feature classifier tutorial on qiime2.org for a detailed example.

If you are unable to train your own classifier, use one of the full-length 16S pre-trained classifiers that are available in the data resources section of qiime2.org

ivandeetan · December 12, 2019, 6:26am

Noted; will do as you suggested. Thank you very much!

system · January 12, 2020, 12:26pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.