Dear QIIME2 community,
I have a dataset of full length 16S reads (synthetic long-read technology). I have tried to assign a taxonomy (using dada2 in R) using three different classifiers: GTDB version 4.2; GTDB version 4.5 and Greengenes 2 (from September 2024). I then checked the fraction of reads that were assign at each taxonomic level using these three classifiers and I get the results shown below:
As expected regardless of the classifier we get very high taxonomic assignment all the way to the genus level. However two interesting patterns emerge. First, the older version of GTDB (version 4.2) assigned significantly more sequences to the species level than the more recent one (version 4.5). Second, using the greengenes 2 classifier there were more ASVs assigned at the kingdom to genus level (but marginally fewer at the species level relative to the most recent GTDB classifier).
I then repeated the above but I filtered for any ASVs that were not assigned at the phylum level (within each classifier method) and as well as removing any sequences that were classified as chloroplast or mitochondria (but there were none for the latter) and I get the following results (the number of ASVs on the plot title refer to the number ASVs kept after filtering from an initial dataset of 951 ASVs):
My questions are simple enough :
-
Why do I get higher resolution of taxonomic assignment (overall but particularly at the species level) with an older GTDB classifier than a more recent version. Is it because as databases grow we have less confidence in species assignment? (in which case the newer classifier is more reliable, despite fewer ASVs assigned at the species level, but in any case species assignment should be interpreted with a pinch of salt and we should focus on genus assignment? )
-
Ultimately the goal is to assess which classifier performs better for this study, but I have tried similar analyses on other datasets and I seem to get similar results (so far). Based on these results which classifier would you prioritise? Greengenes 2 seems to perform better at higher taxonomic levels. Would you choose Greengeens 2 despite lower taxonomic assignment at the genus level (after filtering for ASVs not assigned at the phylum level)?
I am very curious to hear your insights and many thanks in advance!
All the best,
Mark