How to Sort Out Differences Between 16S rRNA Gene Sequence Databases

If I run a FASTQ through different 16S rRNA gene sequence databases - GreenGenes 1, GreenGenes 2, and RDP - I can get extremely different information at the Genus level. With GreenGenes 1 I am likely getting errors because the mapping to taxonomy isn't quite right. In the case of GreenGene 2, the taxonomy is better but many of the genera are mysterious and have very little information in PubMed. Is there some tool that would help me to figure out where taxons in GG1 are being mapped to in GG2 or other databases? I mean if you followed the taxonomy tree higher and higher you would likely see where the databases converge on a common answer and then be able to sort through where assignments are changing.

I can some examples. In a GG1 screen, I have a Genus Coloramator. That disappears from GG2, and I don't quickly see where it was mapped instead.

In a GG1 screen, I have the Order Rhodospirillales with no further breakdown, which RDP is showing as the Genus Fodinicurvata (which falls under the same Order). If I look at the sample in GG2, there is nothing in the Order at all, so it has been completely remapped.

Tracing through such differences is daunting. It's very time-consuming. I have to believe that there might be tools to help consolidate differences in the taxon assignments between different tools?

Hi @pone ,

This is a good question. One that I think we have all struggled with :grin:

Taxonomy is a messy business, and the nomenclature evolves over time as species and clades are reassigned due to updated knowledge about systematics; and updates to nomenclature overall (e.g., entire phyla have been renamed recently!). So old databases, like GG1 (~11 years old now) will contain obsolete names that do not reflect the most recent nomenclature.

Additionally, these taxonomies all effectively follow different nomenclature systems. RDP is based on the International Journal of Systematic and Evolutionary Microbiology and the List of Prokaryotic names with Standing in Nomenclature. GG2 I think is based on the GTDB taxonomy (check out GTDB and their papers to learn more). GG1 is neither of these. So mapping between these will be quite challenging as the nomenclature is not standardized between these.

This paper will not answer your question or give you a way of mapping between these, but will show you just how different results can be between databases due to the nomenclature differences (see figure 5):

This paper might be of interest to you — though it is old now and might not reflect the taxonomies that you are working with:

5 Likes