My questions concern the Silva and Greengenes databases to assign taxonomy in 16S rRNA gene sequencing studies. I recognize that each database has its merits and drawbacks and as such I am evaluating which of these two databases is best suited for my data. Therefore, I used both to assign taxonomy and analyzed for which percentage of reads each database was unable to assign taxonomy to each of the taxonomic ranks (from domain to species). In other words, for Greengenes, I counted the blank cells at the domain, phylum, class, etc. rank. For Silva, I counted not only the blank cells, but also cells containing any of the following: ambiguous taxa, uncultured bacterium, and uncultured. The output yielded the following, which includes the absolute number of unassigned cells as well as their respective relative proportions:
Here, Silva and Greengenes assigned similar proportions of features at the phylum level (note that domain is 0 because I filtered the reads to obtain bacteria only), then Greengenes assigned more for the class and order ranks, after which Silva assigned a greater proportion of features at the family and genus ranks. Since Silva does not curate its database to include the species level, it makes sense why Greengenes assigned more features here. Has anyone else observed this pattern that Greengenes assigns more features than Silva at the class and order ranks? This seems somewhat counterintuitive considering that Silva is the larger of the two databases.
I have looked into the publication SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? by Monika Balvočiūtė and Daniel Huson (2017) for answers and noticed that in the Venn diagrams depicted in Figure 3, the amount of unique taxa in the Greengenes database increases until the order rank and begins to decrease from family onwards, mirroring my observations. However, I am not sure if there is a concrete reason for this similarity or if I'm seeing this pattern simply because I am searching for an answer.
So, in short, my questions are: Why does Greengenes assign more features at the class and order ranks than Silva? Based on the table provided, which database is better suited for my analysis?
I am in the process of making the same decision and I have done the same calculations as yours. Only, I did not remove ambiguous taxa and uncultured from SILVA, that now I think maybe I should. One thing about the greengenes though is that for me, I had some that were actually unassigned but were not empty or NAs in the taxonomy data because they were showns by g__, p__, etc. for different levels. Make sure you remove those as well, cause they are not assigned. Not sure why greengenes shows them like this instead of empty though. What I had for my data (without removing ambiguous and uncultured) are these:
I am no expert on either of these databases nor am I a classifier guru like some of our other moderators, so take my answer below with a big grain of salt
Here is one reason why you may see Greengenes score better at the species level.
Let’s say GG has 1 member of the Akkermansia genus, the mucinphilia species
Now, in recent years, newer members of this genus have been discovered/named glycaniphila and these are only updated in Silva’s newest releases.
Now you have a scenario where a classifier is unable to differentiate between these 2 species within the SILVA database but since there is only 1 member in GG, it will happily call it that sole member. So this to me makes total sense that in some cases the extra data in larger databases leads to lower resolution.
It is important to note that (unless if classifying samples with known composition, e.g., mock communities), more features classified as species does not necessarily mean better, because you don’t know if those classifications are correct.
I agree with @Mehrbod_Estaki regarding the relative strengths and weaknesses of working with larger/more diverse/more recently updated databases. This ties into my point about species-level classifications being correct. To work off of @Mehrbod_Estaki’s example, imagine the true species is A. glycaniphila — GG would classify this to species (A. muciniphila) because there is no ambiguity in the genus, but SILVA would probably classify to genus level if it cannot distinguish A. muciniphila from A. glycaniphila (e.g., because they have identical seqs for the marker gene fragment that you sequenced). Which would you prefer, a correct genus-level classification or an incorrect species-level classification?
You should not remove those. Those are GG’s method of handling taxonomically ambiguous clades, see the original publication for a full description and see this topic for more details:
No — at the very least, removing these could lead to overclassification. These sequences represent real clades that cannot be differentiated to level X based on 16S. Removing them would remove a large number of sequences from the database and impact classifier training — kmers that are not particularly diagnostic could suddenly (and incorrectly) become species-specific signatures because you’ve removed a large amount of true signal. You will get incorrect classifications if you do this (trust me — I’ve tested this very question!)
Hmmmm I think you misunderstood my question. I did not mean removing it from the taxonomy file. What I mean is for the purpose of figuring out which one is classifying more taxa. Just my own calculations not to remove it from the data.
Got it. Yes, for evaluating the results it probably makes sense not to count these as “classified” at species level. On the one hand, classification to an unknown species means that the classifier could match your query sequence to that reference OTU unambiguously, so it is better than unclassified — but that OTU matches 2 or more distinct species.