Taxonomic discrepancy with Greengenes, Silva and NCBI

Hello!
I am doing a bacterial taxonomic analysis of fungal tissue samples (16S V3-V4) using Qiime2 2020.8 (having some problems to update, posted other topic about it...) and databases Greengenes (gg-13-8-99-nb-classifier.qza) and Silva (silva-138-99-nb-classifier.qza).
In the results obtained with Greengenes, i get this taxon:

k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Ewingella | 47.868%
In the results obtained with Silva, I get in the same sample this taxon:

d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Yersiniaceae;g__Serratia | 63.296%

Since I wasn´t sure which of the two genus will be the right one, and clearly its presence is important in the analysis, I filtered my table.qza and ref_seqs.qza using qiime feature-table and get the sequence with a Confidence of 7,039575418E+15. Then I went to NCBI website and blast the sequence obtaining:
Hafnia alvei with a Querry cover of 100% and a Per. ident 99.35%

How could I be sure if the genus is Ewingella, Serratia or Hafnia? any idea how to identify this genus/species acurately???

Thank you in advance!!!

Marisa

Hi @Marisa_Tello_Martin ,

Thank you for raising this important point. These databases all use distinct taxonomic systems, and different types/levels of curation. So the discrepancy is known and expected, and if you start looking deeper you will find that many other discrepancies exist.

Microbial systematics are also always in a state of flux as lineages are regularly added and revised. So accepted taxonomic nomenclature is also gradually changing over time.

So with this in mind there are a couple important points to consider:

  1. The greengenes gg-13-8-99-nb-classifier.qza is very old — the "13" in the name stands for "2013", the last time that database was updated. So it is using a very old taxonomy. There is a newer version, Greengenes2, that is distinct from the one that you are using. I recommend using the newer version (but note: you will still find discrepancies with SILVA and NCBI, as these still use slightly different taxonomies)
  2. using NCBI BLAST with default settings uses the core_nt database, which contains many uncurated sequences that are misannotated and may use old or incorrect taxonomic naming. You could blast against the 16S RefSeqs instead to use a curated resource, but I would not trust anything that I see with core_nt without looking deeper.

I hope that helps!

4 Likes

Thank you for your quick answer @Nicholas_Bokulich

So, in your opinion I should try to use a more "modern" database. Is it possible to use Greengenes2 in version 2020.8 of Qiime2? any advice to update the Qiime2 version on my Ubuntu 22.04.5 LTS?

Thanks again!

1 Like

Hi @Marisa_Tello_Martin,

You are using a very old version of QIIME 2 (2020.8). I'd strongly recommend installing the latest version (2024.10) as a separate environment rather than trying to "update" the existing one. There are many improvements and additional functionality in the latest version. This is one of the benefits of conda environments, you can keep multiple versions of QIIME on your system and just activate the version you need. :slight_smile:

You can install Greengenes2 from here. More details are here.

You can use the RESCRIPt plugin to download and curate SILVA, GTDB, & RDP databases. Note this plugin comes with QIIME 2. See the GitHub page for more tutorials.

2 Likes

Sometimes the problem is that only with 1-2 variable regions you can get 100% identity even in different families.
Anyway you can try with a really curated database like MIMt that the curated version is composed only by sequences from Targeted Loci, type material and Refseq genomes manually curated. You can find it in https://mimt.bu.biopolis.pt. The only curated version is M2c and it has been updated recently in November.