Removing suffixes from GTDB taxonomy to improve qiime results

biojack · June 15, 2022, 5:03am

Hi all

In GTDB there are a lot of taxons which differes only by one letter in suffix. Would be better to remove those suffix to improve classifier results?

for example there are

Bacteria;Firmicutes_A;Clostridia;Lachnospirales;Lachnospiraceae;Hungatella

and

Bacteria;Firmicutes_A;Clostridia;Lachnospirales;Lachnospiraceae;Hungatella_A

genera. So, my concern is that if species under those genera have similar sequences, the classifier will do underclassification on family level like

Bacteria;Firmicutes_A;Clostridia;Lachnospirales;Lachnospiraceae

would it be then better just to rename Hungatella_A to Hungatella

or there are different Firmucutes

Bacteria;Firmicutes_A
Bacteria;Firmicutes_B
Bacteria;Firmicutes_C
Bacteria;Firmicutes_D
Bacteria;Firmicutes_E
Bacteria;Firmicutes_F

would it be better to rename those ones just to Bacteria;Firmicutes at least to avoid confusion of users watching final results?

So would it be better just to rename ALL such examples removing suffixes to improve qiime results? Or such renaming could lead to some serious mistakes in analysis?

Thank you much for your attention

jwdebelius · June 15, 2022, 12:41pm

Hi @biojack,

You don't want to get rid of those suffixes at all! The suffix indicates that it's a phylogenetically distinct clade. This is one of the major advantages of GTDB to me: that the clades are phylogenetically distinct. (It took me a while to figure this out and figure out what to do with it.) Based on past experiences with NB classifiers and this taxonomy, you're not going to have any issues at the phylum level. I can't speak to lower taxonomic levels (I haven't experimented with them, but it will definatlly be distinct at the phylum level.)

As far as re-naming goes: if you're going to re- name, my recommendation would be to consider a different database with a slightly more standard taxonomy. Silva is probably the most popular for amplicon sequencing. Personally, I think the difference in names in phylogenetically distinct clades is a feature, not a bug, and I hope the people who name bacteria will re-name the phyla in better ways, since they're already messing with everything else.

Best,
Justine

SoilRotifer · June 15, 2022, 1:12pm

Hi @biojack, I agree @jwdebelius, those are features, not bugs.

If you are truly interested in editing the taxonomy, then you can use qiime rescript edit-taxonomy ..., to edit the taxonomy strings . This can be quite useful for making figures, etc...

biojack · June 15, 2022, 1:25pm

@jwdebelius @SoilRotifer thank you for clear answer! In general I agree that's feature not a bug : ) But that's for bioinformatics. If someone far from science look on final report - he could lost of comfort - why there are so many Firmucutes there For using Silva there are big problem since a lot of taxons have no match between genera and species. There should be a lot of efforts to clean Silva taxonomy. And GTDB has no such problem. That's why I was thinking about to rename GTDB taxonomy - kind of trade off between using Silva and original GTDB

biojack · June 15, 2022, 2:06pm

@jwdebelius important thing, I forgot to mention that I use 16S V4 region analysis. So that's why also I concerned about underclassification issue since within V4 there are more chances for species on different clusters to be similar than within whole 16S

SoilRotifer · June 15, 2022, 2:08pm

This is not true. There are currently many changes occurring in microbial taxonomy, and these are being reflected in different ways across the many reference databases. In recent years there has been increasing literature on this very subject. Particularly in reference to the taxonomic assignment of unknown / undescribed megatenome assembled genomes (MAGs), which GTDB and other reference databases are trying to help resolve..

I'd not worry too much about this, as this can be explained in your analysis descriptions and figure legends, with appropriate references to GTDB that explain these. You can guide the reader with the appropriate references to inform them of the latest information. Which is what science is all about.

This is because SILVA does not curate down to the species level. Although we provide an option to append the organism names in RESCRIPt as the species labels, we warn users that this can be erroneous. Also, SILVA contains eukaryal sequences which GTDB currently does not contain. Which is useful to many.

All databases have issues, even GTDB, SILVA, GreenGenes, etc.... Curating databases is no easy task, but they will all improve over time. It comes down to which databases best serves your current questions.

But it looks like you are doing a great job vetting your needs for your research! Good luck!

biojack · June 23, 2022, 3:40pm

Hi, all.

I provided investigation by my own to close some unclear points.

On some gut samples (~500 ones ) I ran Silva and GTDB databases and look at some bacteria. I was surprised when I found in some samples Enteroccocus according to Silva DB and not found with according to GTDB.

Then I remove suffixed from GTDB and repeated analysis. And at that one, Enteroccocus was found in GTDB as in Silva DB. The problem was in suffixes "Enterococcus", "Enterococcus_A", "Enterococcus_B", "Enterococcus_C", "Enterococcus_D", "Enterococcus_E", "Enterococcus_F", "Enterococcus_G", "Enterococcus_I", "Enterococcus_H". Classifier fail to distinguish them and there was underclassication for Enterococcus genus.

So if you providing (as I do) analysis within small region (like 16S V4) - removing GTDB suffixes could be way to go.

Good luck!

SoilRotifer · June 23, 2022, 4:31pm

Thanks for sharing this!

We discuss this similar issue within the SILVA tutorial under the "Species-labels: caveat emptor!" section.

-Mike

system · July 24, 2022, 10:32pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.