Green genes vs. SILVA

Hello! :wave: :qiime2:

Can somebody explain the difference between these two?

Thanks! :pray:


I think this article is a nice review of the reference taxonomies.


Hi @Tina_Khone,

Just to extend the great link provided by @the_dummy, in brief Greengenes contains only reference sequences from Bacteria and Archaea. Though, this database has not been updated in quite a long while.

SILVA contains two different reference databases that include the Small subunit (SSU) and Large subunit (LSU). The SSU contains the 16S and 18S rRNA genes and the LSU contains the 23S and 28S rRNA genes. Both contain data for Archaea, Bacteria, and Eukarya. SILVA is also updated periodically.

FYI, our RESCRIPt tool allows you to easily download SSU and LSU data to make your own reference database. We have plans to allow retrieval from other databases too. Anyway, you can read more about some of our database comparisons in our preprint.



GreenGenes vs SILVA in one Figure:

This is Extended Data Figure 2 from the Earth Microbiome Project main paper.

Figure caption: a, Median sequence length per study after quality trimming. Original EMP studies used 90-bp reads, which were replaced by 100-bp reads for the majority of studies, and have since been replaced by 150–151-bp reads. For most analyses presented in this manuscript, we used the Deblur algorithm and trimmed tag sequences to 90 bp. This allowed inclusion of older studies with shorter read lengths.

b, Comparison of Greengenes and SILVA rRNA databases for reference-based OTU picking. Fraction of reads in n = 23,828 biologically independent samples—separated by environment (per-environment n shown in Fig. 1a)—mapping to Greengenes 13.8 and SILVA 123 (97% identity OTUs) with closed-reference OTU picking. Boxplots show median, IQR, and 1.5 × IQR (with outliers). The fraction of reads mapping was similar between Greengenes and SILVA in each environment but slightly higher with SILVA for every environment.

c, Alpha-diversity in closed-reference OTUs picked against Greengenes 13.8 and SILVA 123, with sequences rarefied to 100,000, 30,000, 10,000, and 1,000 sequences per sample, displayed as boxplots showing median, IQR, and 1.5 × IQR (with outliers). The sample set for all calculations contained n = 4,667 biologically independent samples having at least 100,000 observations in both Greengenes and SILVA OTU tables. Alpha-diversity metrics were higher with SILVA closed-reference OTU picking than with Greengenes.

d, Beta-diversity among all EMP samples using principal coordinates analysis (PCA) of weighted UniFrac distance. Principal coordinates PC1 versus PC2 and PC1 versus PC3 are shown coloured by EMPO levels 2 and 3. As with unweighted UniFrac distance (Fig. 2c), clustering of samples using weighted UniFrac distance could be explained largely by environment.

Note how more reads map to SILVA, and yet GreenGenes still captures greater phylogenetic divergence between samples (see the higher Faith's PD values in C4).


Thank you all for your insightful answers!

1 Like