Is unrestricted taxonomic assignment better for estimating numbers of genera?

Hi @Jan_Kollar,

You bring up very good points. It could be that these hits happen to be "just right" for your data set.

This made me realize that I should have suggested that you try using the weighted classifiers, these are also available on the Data resources page. Here is a link to some environment specific weighted classifiers for several reference databases. More discussion can be found here.

To better explain these weighted classifiers, I often like to use the analogy of bird watching. That is, when you go bird watching you often use a local bird book, that highlights the birds known to frequent your area. Thus, when you try to identify these birds, you are referring to only to a small subset of all globally known birds. Which would make sense, as you'd not expect most birds in Africa to appear in North America. For example, there might be only one yellow bird species in your area, but globally there are likely hundreds... thousands? So, knowing that only one yellow bird is in your area, makes it easy to identify.

When we are using classifiers to identify our microbial sequences, we are essentially mapping to the global set of all known microbes. Which might not be the best thing to do in all cases, as many different taxa might have identical sequences (or nearly so) . If the taxa you are trying to disambiguate have identical sequence over the amplicon region, then the classifier might only return a family-level classification, as it is matching many global hits and taking a consensus taxonomy. But a weighted classifier (trained for a particular environment, like a local bird book), might have a better chance of returning a sensible hit to a genus level classification, because it was trained with information about which taxa (and their sequences) are most likely to be found in your given environment.

I am quite over simplifying, and the other moderators can provide more details on the nuts and bolts of how this works, assuming I explained correctly. :slight_smile:

I will end-off saying: you appear to have much more knowledge of the taxonomy of your groups and know something about the system you are in. So, I think you can reasonably make the case, that the genera that you think are present might actually be present. Perhaps you can align these amplicons to some reference genera and see how they phylogenetically cluster? Just remember that the SSU rRNA gene is quite conserved. Even the SSU hypervariable regions are quite conserved compared to other marker genes.

You can also use RESCRIPt to curate your own reference database. For example, you might consider some of the other options for the qiime rescript dereplicate ... command compared to what is outlined in the tutorial.

This sounds like a really cool project! Reminds me of my micro-invertebrate days. :slight_smile:

2 Likes