My questions concern the Silva and Greengenes databases to assign taxonomy in 16S rRNA gene sequencing studies. I recognize that each database has its merits and drawbacks and as such I am evaluating which of these two databases is best suited for my data. Therefore, I used both to assign taxonomy and analyzed for which percentage of reads each database was unable to assign taxonomy to each of the taxonomic ranks (from domain to species). In other words, for Greengenes, I counted the blank cells at the domain, phylum, class, etc. rank. For Silva, I counted not only the blank cells, but also cells containing any of the following: ambiguous taxa, uncultured bacterium, and uncultured. The output yielded the following, which includes the absolute number of unassigned cells as well as their respective relative proportions:
Here, Silva and Greengenes assigned similar proportions of features at the phylum level (note that domain is 0 because I filtered the reads to obtain bacteria only), then Greengenes assigned more for the class and order ranks, after which Silva assigned a greater proportion of features at the family and genus ranks. Since Silva does not curate its database to include the species level, it makes sense why Greengenes assigned more features here. Has anyone else observed this pattern that Greengenes assigns more features than Silva at the class and order ranks? This seems somewhat counterintuitive considering that Silva is the larger of the two databases.
I have looked into the publication SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? by Monika Balvočiūtė and Daniel Huson (2017) for answers and noticed that in the Venn diagrams depicted in Figure 3, the amount of unique taxa in the Greengenes database increases until the order rank and begins to decrease from family onwards, mirroring my observations. However, I am not sure if there is a concrete reason for this similarity or if I’m seeing this pattern simply because I am searching for an answer.
So, in short, my questions are: Why does Greengenes assign more features at the class and order ranks than Silva? Based on the table provided, which database is better suited for my analysis?
Many thanks in advance!