would it be better to rename those ones just to Bacteria;Firmicutes at least to avoid confusion of users watching final results?
So would it be better just to rename ALL such examples removing suffixes to improve qiime results? Or such renaming could lead to some serious mistakes in analysis?
You don't want to get rid of those suffixes at all! The suffix indicates that it's a phylogenetically distinct clade. This is one of the major advantages of GTDB to me: that the clades are phylogenetically distinct. (It took me a while to figure this out and figure out what to do with it.) Based on past experiences with NB classifiers and this taxonomy, you're not going to have any issues at the phylum level. I can't speak to lower taxonomic levels (I haven't experimented with them, but it will definatlly be distinct at the phylum level.)
As far as re-naming goes: if you're going to re- name, my recommendation would be to consider a different database with a slightly more standard taxonomy. Silva is probably the most popular for amplicon sequencing. Personally, I think the difference in names in phylogenetically distinct clades is a feature, not a bug, and I hope the people who name bacteria will re-name the phyla in better ways, since they're already messing with everything else.
If you are truly interested in editing the taxonomy, then you can use qiime rescript edit-taxonomy ..., to edit the taxonomy strings . This can be quite useful for making figures, etc...
@jwdebelius@SoilRotifer thank you for clear answer! In general I agree that's feature not a bug : ) But that's for bioinformatics. If someone far from science look on final report - he could lost of comfort - why there are so many Firmucutes there For using Silva there are big problem since a lot of taxons have no match between genera and species. There should be a lot of efforts to clean Silva taxonomy. And GTDB has no such problem. That's why I was thinking about to rename GTDB taxonomy - kind of trade off between using Silva and original GTDB
@jwdebelius important thing, I forgot to mention that I use 16S V4 region analysis. So that's why also I concerned about underclassification issue since within V4 there are more chances for species on different clusters to be similar than within whole 16S
This is not true. There are currently many changes occurring in microbial taxonomy, and these are being reflected in different ways across the many reference databases. In recent years there has been increasing literature on this very subject. Particularly in reference to the taxonomic assignment of unknown / undescribed megatenome assembled genomes (MAGs), which GTDB and other reference databases are trying to help resolve..
I'd not worry too much about this, as this can be explained in your analysis descriptions and figure legends, with appropriate references to GTDB that explain these. You can guide the reader with the appropriate references to inform them of the latest information. Which is what science is all about.
This is because SILVA does not curate down to the species level. Although we provide an option to append the organism names in RESCRIPt as the species labels, we warn users that this can be erroneous. Also, SILVA contains eukaryal sequences which GTDB currently does not contain. Which is useful to many.
All databases have issues, even GTDB, SILVA, GreenGenes, etc.... Curating databases is no easy task, but they will all improve over time. It comes down to which databases best serves your current questions.
But it looks like you are doing a great job vetting your needs for your research! Good luck!
I provided investigation by my own to close some unclear points.
On some gut samples (~500 ones ) I ran Silva and GTDB databases and look at some bacteria. I was surprised when I found in some samples Enteroccocus according to Silva DB and not found with according to GTDB.
Then I remove suffixed from GTDB and repeated analysis. And at that one, Enteroccocus was found in GTDB as in Silva DB. The problem was in suffixes "Enterococcus", "Enterococcus_A", "Enterococcus_B", "Enterococcus_C", "Enterococcus_D", "Enterococcus_E", "Enterococcus_F", "Enterococcus_G", "Enterococcus_I", "Enterococcus_H". Classifier fail to distinguish them and there was underclassication for Enterococcus genus.
So if you providing (as I do) analysis within small region (like 16S V4) - removing GTDB suffixes could be way to go.