Why do we get higher species level assignment with an older GTDB classifier than with a more recent one?

Mark_Gillingham · March 20, 2025, 5:38pm

Dear QIIME2 community,

I have a dataset of full length 16S reads (synthetic long-read technology). I have tried to assign a taxonomy (using dada2 in R) using three different classifiers: GTDB version 4.2; GTDB version 4.5 and Greengenes 2 (from September 2024). I then checked the fraction of reads that were assign at each taxonomic level using these three classifiers and I get the results shown below:

As expected regardless of the classifier we get very high taxonomic assignment all the way to the genus level. However two interesting patterns emerge. First, the older version of GTDB (version 4.2) assigned significantly more sequences to the species level than the more recent one (version 4.5). Second, using the greengenes 2 classifier there were more ASVs assigned at the kingdom to genus level (but marginally fewer at the species level relative to the most recent GTDB classifier).

I then repeated the above but I filtered for any ASVs that were not assigned at the phylum level (within each classifier method) and as well as removing any sequences that were classified as chloroplast or mitochondria (but there were none for the latter) and I get the following results (the number of ASVs on the plot title refer to the number ASVs kept after filtering from an initial dataset of 951 ASVs):

My questions are simple enough :

Why do I get higher resolution of taxonomic assignment (overall but particularly at the species level) with an older GTDB classifier than a more recent version. Is it because as databases grow we have less confidence in species assignment? (in which case the newer classifier is more reliable, despite fewer ASVs assigned at the species level, but in any case species assignment should be interpreted with a pinch of salt and we should focus on genus assignment? )
Ultimately the goal is to assess which classifier performs better for this study, but I have tried similar analyses on other datasets and I seem to get similar results (so far). Based on these results which classifier would you prioritise? Greengenes 2 seems to perform better at higher taxonomic levels. Would you choose Greengeens 2 despite lower taxonomic assignment at the genus level (after filtering for ASVs not assigned at the phylum level)?

I am very curious to hear your insights and many thanks in advance!

All the best,

Mark

timanix · March 20, 2025, 9:17pm

Hello!

You are right! There is a paper that I think is relevant: Database size positively correlates with the loss of species-level taxonomic resolution for the 16S rRNA and other prokaryotic marker genes

Did you use 16S GTDB (species represenstive) or full implementation (all)? You can use the same version of the database but different implementation (representatives or all) and compare the output.

I would go either for latest GTDB or GG2. I would not use older versions of GTDB because of outdated taxonomies.

Hope that helps,

colinbrislawn · March 20, 2025, 10:48pm

Hi Mark!

Great question.

This reminds of of when Robert Edgar worked on taxonomy in 2018 and wrote up a full post about why his brand new SINTAX method couldn't predict species:
https://www.drive5.com/usearch/manual/sintax_species.html

If you're using short tags like V4, then it often happens that two or more species have identical tag sequences, making it impossible to identify which species you're looking at. This scenario might not be detectable from the database because the vast majority of species have not been named by taxonomists and do not appear in the RDP training set, so there could be a novel species with an identical sequence. In other words, the reference database is sparse: it has missing data --- lots of missing data.

On that note, it looks like the graphs you shared show total annotations i.e. the sum of true positives and false positives.

I would love to see these same graphs split by correct and confidently-incorrect annotations (TP and FP), which would put the unannotated features (TN and FN) into perspective.

Getting to 100% annotation is easy; just assign that ASV to it's top hit in the database!
Or better yet, just call it L. colini in my honor!

(This is a joke. I don't recommend maximizing false positives. But it's better to do this intentionally then by mistake.)

Mark_Gillingham · March 21, 2025, 2:00pm

Hi Timur and Colin,

Many thanks for your feedback! This is really useful already. I will try out your suggestions next week and I post the results here.

Till then, have a nice weekend!

Mark

Mark_Gillingham · March 26, 2025, 6:22pm

Hi all,

So I went back to this analysis. To provide more information in answer to Timur and Colin I used the following code to assign taxonomy:

taxtab <- assignTaxonomy(seqtab, paste0(dataDir,"/GTDB_bac120_arc122_ssu_r202_fullTaxo.fa.gz"), multithread = 1, minBoot = 80, tryRC = TRUE)

Where seqtab is of course a vector of sequences to be assigned and the classifier was downloaded here:

I realised whilst writing this that I had forgotten to add the "tryRC = TRUE" argument. So here is an update of the figure initially posted:

@timanix that's right I used the full implementation. Am I right in thinking that to do species representative I need to do the following?:


taxtab <- assignTaxonomy(seqtab, paste0(dataDir,"/GTDB_bac120_arc53_ssu_r220_genus.fa.gz"), multithread = 1, minBoot = 80, tryRC = TRUE)
taxa.spec <- addSpecies(taxtab, refFasta = paste0(dataDir,"/GTDB_bac120_arc53_ssu_r220_species.fa.gz"), verbose = TRUE)

If so, here are the results (without filtering):

and after filtering for any ASVs that were not assigned at the phylum level:

At least up to the genus taxonomic level this seems to marginally improve assignment using the GTDB database.

@colinbrislawn forgive me for my naivety. I would love to split by correct and confidently-incorrect annotations but I will be honest, I don't know how to go about doing that

For openness here is how I (and inspired from this blog: UCD Bioinformatics Core Workshop) built the last figure I present above for the greengenes 2 classifier (with a phyloseq object called ps_ZEFI_GG2_clean):

readsPerSample = rowSums(otu_table(ps_ZEFI_GG2_clean))
fractionReadsAssigned = sapply(colnames(tax_table(ps_ZEFI_GG2_clean)), function(x){
  rowSums(otu_table(ps_ZEFI_GG2_clean)[, !is.na(tax_table(ps_ZEFI_GG2_clean))[,x]]) / readsPerSample
})


fractionReadsAssigned = data.frame(SampleID = rownames(fractionReadsAssigned), fractionReadsAssigned)
fractionReadsAssigned.L = pivot_longer(fractionReadsAssigned, 
    cols=colnames(tax_table(ps_ZEFI_GG2_clean)), names_to="taxlevel", values_to="fractionReadsAssigned")
fractionReadsAssigned.L = data.frame(fractionReadsAssigned.L, sample_data(ps_ZEFI_GG2_clean)[fractionReadsAssigned.L$SampleID,])

fractionReadsAssigned.L$taxlevelf = factor(fractionReadsAssigned.L$taxlevel, levels=c("Kingdom","Phylum","Class","Order","Family","Genus","Species"))

fractionReadsAssigned.L <- fractionReadsAssigned.L[which(!is.na(fractionReadsAssigned.L$fractionReadsAssigned)),]

(Assigned_GG2<-ggplot(fractionReadsAssigned.L, aes(y = fractionReadsAssigned, x = taxlevelf)) +  theme_bw(base_size = 12) +  geom_boxplot(color='black', outlier.shape=NA) +  ggtitle("Greengenes 2 (Sept 2024): 847 ASVs") + geom_point() + ylab ("Fraction of reads assigned") + xlab ("Taxonomic rank"))


GG2_assignment <- tapply(na.omit(fractionReadsAssigned.L$fractionReadsAssigned),fractionReadsAssigned.L$taxlevelf, mean)


(GG2_assignment_df <- data.frame(t(round(GG2_assignment,4)*100)))

GG2_assignment_df <- cbind(Kingdom = GG2_assignment_df$Kingdom, Phylum = GG2_assignment_df$Phylum, Class = GG2_assignment_df$Class, Order = GG2_assignment_df$Order, Family = GG2_assignment_df$Family, Genus = GG2_assignment_df$Genus, Species = GG2_assignment_df$Species)


GG2_table <- tableGrob(GG2_assignment_df)

library (patchwork)

(Assigned_GG2 / GG2_table) + plot_layout(height = c(5,1)

Many thanks again for your feedback!

Mark

colinbrislawn · March 26, 2025, 8:09pm

That's okay! It's a hard question.

As an example, let's say dada2:: addSpecies() was broken, and it just added s_colini to every feature.

This yields a species-level annotation of 100%.
However, most of these species level annotations are wrong!
How could this be measured experimentally?

Let's say addSpecies() added a species at random from with the known genus.
How could that be measured?

timanix · March 27, 2025, 7:57am

Hello!

What I meant is using different version of the GTDB database. If you are downloading it with RESCRIPt, there is an option to switch the type database (all and SpeciesReps). You also can download them from their website.

In my opinion (no tests performed!) the full SSU GTDB ('All') works better for such tools as BLAST with top-hit selection (check percent identity if the selection is trustworthy) while "SpeciesReps" (only species representatives) works better with Naive Bayes classifiers

Mark_Gillingham · March 27, 2025, 6:24pm

Ah yes I understand what you mean now. Well I guess the only way to do that is by using mock communities

Nonetheless, for this dataset, we can look at bootstrap confidence for assigning a taxonomic level which I have done today using the latest GTDB and greengenes classifiers (the values in the table are mean values at each taxonomic level):

At least in this dataset, it seems that not only are more reads assigned when using greengenes 2 but there is also more confidence in the assignment.

Of course the script used in my original post fixed a minimum bootstrap confidence of 80 for assigning a taxonomic level, which explains why fewer ASVs are kept when using the GTDB classifier compared to greengenes 2.

Mark_Gillingham · March 27, 2025, 6:26pm

Ok I understand what you mean now. I have not invested the time (yet) to build my own classifier using RESCRIPt, but might do so soon. Thanks for the tip.