Is unrestricted taxonomic assignment better for estimating numbers of genera?

Jan_Kollar · September 25, 2023, 2:18pm

Hi all,

I'm analysing a small dataset of six samples with 113 ASVs. I'm assigning taxonomy through the Naive Bayes classifier. I wanted to compare (A) the results obtained with default settings with (B) the results obtained when forcing the classifier to give me the identification for the finest taxonomic level (i.e., switching off the sklearn's default confidence-based taxonomic level restriction using "--p-confidence 0" parameter). These are the exact commands I'm using:

qiime feature-classifier classify-sklearn
--i-classifier V4-classifier.qza
--i-reads filtered-rep-seqs-uganda.qza
--o-classification taxonomy-uganda-restricted.qza

qiime feature-classifier classify-sklearn
--i-classifier V4-classifier.qza
--i-reads filtered-rep-seqs-uganda.qza
--p-confidence 0
--o-classification taxonomy-uganda-non-restricted.qza

(I'm using QIIME2 v. 2022.8 in a conda environment through WSL2.)

It works well but I was at first confused by the fact that I'm receiving different numbers of genera (11 vs 14 for restricted vs unrestricted assignment) as I have expected identical results differing only in the taxonomic assignments themselves (please, see the attached picture; restricted left, unrestricted middle, what I'd expect on the right). Then, I've realized the obvious, that the difference is caused by the fact that, in the restricted taxonomic assignment, more ASVs (and thus also genera found in the unrestricted analysis) are merged within the incomplete assignments such as:
- k__Eukaryota;p__Bacillariophyta;c__Bacillariophyceae;;;__
- k__Eukaryota;p__Bacillariophyta;;;;
- k__Eukaryota;p__Bacillariophyta;c__Bacillariophyceae;o__Cocconeidales;; and so on

So my question follows. When I'm interested in the number of genera alone, isn't it actually more meaningful to extract it from the results based on the unrestricted assignment? I guess there is a good chance that, even if the taxon itself is incorrect, the number of genus-level differences (i.e. number of genera) would be estimated more accurately (leaving aside the general database-related issues) than in the default restricted analysis. What do you think? I'll be grateful for any and all ideas.

SoilRotifer · September 25, 2023, 7:19pm

Hi @Jan_Kollar,

I suggest reading through this post to get a better understanding of how the classifier works.

My off-the-cuff response would be no. The classifier is unsure of what these sequences are (at your confidence setting)..., it has no idea what genera these sequences belong to. Many of these unclassified genera might belong to the same genera, or they may belong to different genera.

That is, your results would be similar to taking the top BLAST hit. Which is not good, as the top hit (one genera) might be statistically the same as the 200th hit (another genera), and your genus-level taxonomic assignments would be arbitrary. Leading to inadvertent lumping or splitting of your ASVs by this spurious taxonomy. Then further conflated by the biases of the representatives in the database.

In this case, your estimate of the number of genera is suspect. Especially, given that the taxonomy of today, is not going to be the taxonomy of tomorrow.

One approach would be to summarize the number of ASVs per family, or class, or order, etc... Akin to the old genus / species ratios. To approximate some semblance of relatedness of the ASVs, you can even cluster them at 99, 98, or 97 % and use a ratio of OTUs per Family, etc.

gregcaporaso · September 25, 2023, 7:27pm

I was about to reply when @SoilRotifer did - my gut is also that genus counts would be less accurate, not more accurate, if you used a confidence threshold of 0.

Jan_Kollar · September 26, 2023, 9:05am

Hi @SoilRotifer and @gregcaporaso,

thank you very much for your prompt and thoughtful answers! I've read the suggested post (as well as the "original post" mentioned by Ben Kaehler there) - many thanks for it. I think I understand Mike's explanation and it did change my view. But your answers also raise one more question in my mind: why exactly do you think that the restricted assignment is more accurate in estimating the number of genera than the unrestricted one?

As I understand it now, the restricted assignment probably underestimates the "true" (see below) generic diversity (by lumping ASVs into these "catch-all" unassigned artificial genera), while the unrestricted can both overestimate or underestimate it. Do you consider the restricted assignment more accurate simply because it's more conservative then? Or because it's likely to be skewed in only one direction (i.e. underestimating), while the unrestricted can be skewed in both ways (i.e. both underestimating or overestimating)?

By "true" generic diversity (being aware that genus and higher taxa are arbitrary categories altogether) I mean the number of genera which are present in these samples based on the formal taxonomy of the group (unlike in bacteria, the formal taxonomy of my group is not primarily based on SSU rRNA). The formal taxonomy of my group (diatoms) is kind of special among microbes because it is still almost entirely based on morphology (not ideal, but that's another story), meaning that the taxonomic IDs within the custom reference sequence database (built from NCBI) used for the training of my classifier are also based on this morphology-based formal taxonomy.

I understand now that the best thing would probably be to ignore the genus counts altogether and focus on ASV counts only. Nevertheless, there is some notion in my community, that genera are somehow more stable taxonomic units and people... just want them In this regard, I like Mike's suggestions about genus/species ratio. Given how morphology-based our formal taxonomy is, here's how we've approached it.

In addition to metabarcoding, we've examined our samples under the light and electron microscope and identified 15 diatom genera there. Of course, there will be more but we have something to compare the V4 18S rRNA estimate to (and soon, we'll have also amplicon data from another gene, for which there is a taxonomically curated database of diatoms). For me, the question now is which of the alternative 18S-based genus count estimates (i.e. restricted gives 11 genus-level clusters with 6 identified confidently to the genus level under the default confidence threshold - see the picture above; unrestricted strategy gives 14 genera) is the most appropriate to use in such comparison. I'm starting to feel that there is no right answer to this, but in any case, counting only the confidently assigned genera (as one of my colleagues suggests, i.e. 6) seems to be definitely wrong.

SoilRotifer · September 26, 2023, 1:52pm

Hi @Jan_Kollar,

You bring up very good points. It could be that these hits happen to be "just right" for your data set.

This made me realize that I should have suggested that you try using the weighted classifiers, these are also available on the Data resources page. Here is a link to some environment specific weighted classifiers for several reference databases. More discussion can be found here.

To better explain these weighted classifiers, I often like to use the analogy of bird watching. That is, when you go bird watching you often use a local bird book, that highlights the birds known to frequent your area. Thus, when you try to identify these birds, you are referring to only to a small subset of all globally known birds. Which would make sense, as you'd not expect most birds in Africa to appear in North America. For example, there might be only one yellow bird species in your area, but globally there are likely hundreds... thousands? So, knowing that only one yellow bird is in your area, makes it easy to identify.

When we are using classifiers to identify our microbial sequences, we are essentially mapping to the global set of all known microbes. Which might not be the best thing to do in all cases, as many different taxa might have identical sequences (or nearly so) . If the taxa you are trying to disambiguate have identical sequence over the amplicon region, then the classifier might only return a family-level classification, as it is matching many global hits and taking a consensus taxonomy. But a weighted classifier (trained for a particular environment, like a local bird book), might have a better chance of returning a sensible hit to a genus level classification, because it was trained with information about which taxa (and their sequences) are most likely to be found in your given environment.

I am quite over simplifying, and the other moderators can provide more details on the nuts and bolts of how this works, assuming I explained correctly.

I will end-off saying: you appear to have much more knowledge of the taxonomy of your groups and know something about the system you are in. So, I think you can reasonably make the case, that the genera that you think are present might actually be present. Perhaps you can align these amplicons to some reference genera and see how they phylogenetically cluster? Just remember that the SSU rRNA gene is quite conserved. Even the SSU hypervariable regions are quite conserved compared to other marker genes.

You can also use RESCRIPt to curate your own reference database. For example, you might consider some of the other options for the qiime rescript dereplicate ... command compared to what is outlined in the tutorial.

This sounds like a really cool project! Reminds me of my micro-invertebrate days.

Jan_Kollar · September 27, 2023, 4:52pm

Hi @SoilRotifer,

Very interesting suggestion and an excellent explanation, will definitely try that, thank you!

Did that, helped with resolving some ASVs at genus level (and, interestingly, the unrestricted classification was correct there at the genus level as well! Plus BLAST also suggested the same as the top hit, so yay!).

Actually, I used your excellent tutorial while building my custom 18S diatom reference database, along with this one. Extremely useful, thank you! I'll check also the one you suggest.

Yeah, it's very small, very simple (relatively), and a lot of fun. With 113 ASVs, it's possible to basically pay attention to every one of them, which is kind of cool, especially compared to working with bacteria. Soon, I'll have to move to our global dataset with several hundreds of samples and I have some feeling that I'll look back to these humble beginnings with tears of pain in my eyes

Anyway, thank you very much for your time and ideas, it's a great help!

SoilRotifer · September 27, 2023, 7:25pm

No problem at all. Best wishes on your project moving forward!