Taxonomic assignment at the species level

jwdebelius · August 13, 2021, 10:09pm

It sounds like a super cool project! There are a couple of issue here to consider when we talk about species accuracy. I've made an expanable list because this is relatively bit post.

1. The limits of the biology

Species are kind of a weird theoretical concept in organisms that can undergo sexual reproduction, asexual reproduction, and just kind of randomly pick up DNA from friends, family and strangers. (Seriously, bacterial reproduction and sex is kinda weird, and HGT is a pain.) So, even the fundemental idea of a "species" is challenging in bacteria. I'm going to link you to a wiki rabit hole on the topic. This is further complicated by the fact that our artifical naming conventions don't actually relate to true molecular phylogeny. ( are and somehow not entirely terrifying.)

Related to this evolution/classification/biology problem is that there are "species" that are sometimes very interesting to researchers can't actually be identified by 16S rRNA sequences. Sometimes, we can't even tell apart at higher taxonomic levels, the Shigella/Escherichia problem being a classic point of frustration for fecal researchers.

But... let's say that you want a name for your organisms because it sparks joy.

2. The limits of the database

I think it's probably worth reading the Species caveat in the RESCRIPt tutorial, but I'll reiterate a key point in terms of species and Silva: We don't know if Silva curates their species. I tried going through the Silva readme to see if I could find anything, but alas.

Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

Correlated to this issues: species assignments are notoriously database dependent, and taxonomy gets renamed at random. (This is separate from the whole taxonomy ≠ phylogeny problem .)

3. Limits of your algorithm

My go-to paper on classification and quality is the Wang et al 2007 RDP classifier paper is a key read for naive bayesian classification. Specifically, I think it's worth looking at Figure 1 which describes the classification accuracy based on sequence length and taxonomic level. You'll note that they don't even describe species. Updated papers like Bokulic et al about optimizing feature classifiers are worth a read. (It's the q2-feature-classifier paper!) The short version is that curation, or knoweldge about your enviroment can improve performance. You may want to look at clawback here:

Using q2-clawback to assemble taxonomic weights

4. Implications

My last question is a largely theoretical one. Would Roseae rosa called Roseae ASV a4bc smell as sweet? Is there a distance benefit in your analysis or your ecosystem in having the specific taxonomic label? Is enough known about your ecosystem for there to be specificity? (I have no idea about skin)

It's probably also worth a search of the forum on this topic, because there's been a fiar bit of discussion in the past

https://forum.qiime2.org/search?q=species

Best,
Justine