Classification using GreenGenes v. SILVA

aphanotus · March 6, 2017, 3:23pm

I just started using qiime2 for a project where I have data from two 16S regions as well as 18S. I’ve been following the advice to build custom classifiers for each region, based on the amplification primers. However, so far the “best” results have come from the full GreenGenes dataset without refinement to a specific region. And by “best”, I mean most OTUs are named at low taxonomic levels. Is this simply false certainty? Is there anyway to obtain support for the taxonomic assignments made by qiime feature-classifier classify? What are the relative merits of classification with GreenGenes vs. SILVA? (FWIW, I’m using SILVA release 128 99% majority datasets.) I’d welcome any opinions or suggested references.
Thanks!

gregcaporaso · March 6, 2017, 3:24pm

@BenKaehler, could you answer this question?

BenKaehler · March 6, 2017, 7:10pm

Thanks @aphanotus for your questions.

Are you setting the --p-confidence parameter when you run qiime feature-classifier classify?

If not, then if OTUs are not being named at low taxonomic levels then I suspect it is because they may be being assigned labels from the reference taxonomy that don't specify the taxonomy at that level. (The reference might only label a sequence to the family level, for instance.)

Is this simply false certainty?

Again, if you are not setting --p-confidence, then yes, presence or absence of assignment at a specific level just reflects similarity to unlabelled OTUs in the reference, and is not a reflection of certainty.

Is there anyway to obtain support for the taxonomic assignments made by qiime feature-classifier classify?

We have a --p-confidence parameter that tries to trim assignments to a point they achieve a certain level of "goodness". Words like "confidence" and "support" are tricky because they are often misused and misunderstood. Our --p-confidence parameter is so-named because its behaviour is similar to that in RDP classifier. However, it should be noted that it has key differences that we are just in the process of benchmarking. I would suggest experimenting with --p-confidence but not trusting it until we have some more solid recommendations.

What are the relative merits of classification with GreenGenes vs. SILVA?

@gregcaporaso or @Nicholas_Bokulich, could you please comment on this one?

jairideout · March 10, 2017, 6:14pm

An off-topic reply has been split into a new topic: How to interpret taxonomy assignment confidence scores?

Please keep replies on-topic in the future.

gregcaporaso · March 15, 2017, 5:31pm

This is a good question. The Greengenes taxonomy is a bit easier to work with computationally, as the taxonomy strings are more concise and the taxonomic levels are more consistent. In Greengenes there are always exactly seven levels corresponding to kingdom, phylum, class, order, family, genus, and species.

Silva on the other hand is more recently updated, and contains Bacteria, Archaea, and Eukaryotes (while Greengenes contains only the Bacteria and Archaea).

Since your study includes (eukaryotic) 18S sequences, you'll need to use Silva to classify those since Greengenes does not include eukaryotes.