Taxonomic assignment at the species level

vetalinesantana · August 13, 2021, 9:06pm

Hello everybody,

I've heard that doing taxonomic assignments at the species level is not accurate. That's true? Does anyone have a scientific paper that discusses this?

I used the classifier skylearn and UNITE and SILVA 138 databases for my studies on the skin microbiota of cats.

Thanks!

jwdebelius · August 13, 2021, 10:09pm

Hi @vetalinesantana,

It sounds like a super cool project! There are a couple of issue here to consider when we talk about species accuracy. I've made an expanable list because this is relatively bit post.

1. The limits of the biology

Species are kind of a weird theoretical concept in organisms that can undergo sexual reproduction, asexual reproduction, and just kind of randomly pick up DNA from friends, family and strangers. (Seriously, bacterial reproduction and sex is kinda weird, and HGT is a pain.) So, even the fundemental idea of a "species" is challenging in bacteria. I'm going to link you to a wiki rabit hole on the topic. This is further complicated by the fact that our artifical naming conventions don't actually relate to true molecular phylogeny. ( are and somehow not entirely terrifying.)

Related to this evolution/classification/biology problem is that there are "species" that are sometimes very interesting to researchers can't actually be identified by 16S rRNA sequences. Sometimes, we can't even tell apart at higher taxonomic levels, the Shigella/Escherichia problem being a classic point of frustration for fecal researchers.

But... let's say that you want a name for your organisms because it sparks joy.

2. The limits of the database

I think it's probably worth reading the Species caveat in the RESCRIPt tutorial, but I'll reiterate a key point in terms of species and Silva: We don't know if Silva curates their species. I tried going through the Silva readme to see if I could find anything, but alas.

Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

Correlated to this issues: species assignments are notoriously database dependent, and taxonomy gets renamed at random. (This is separate from the whole taxonomy ≠ phylogeny problem .)

3. Limits of your algorithm

My go-to paper on classification and quality is the Wang et al 2007 RDP classifier paper is a key read for naive bayesian classification. Specifically, I think it's worth looking at Figure 1 which describes the classification accuracy based on sequence length and taxonomic level. You'll note that they don't even describe species. Updated papers like Bokulic et al about optimizing feature classifiers are worth a read. (It's the q2-feature-classifier paper!) The short version is that curation, or knoweldge about your enviroment can improve performance. You may want to look at clawback here:

Using q2-clawback to assemble taxonomic weights

4. Implications

My last question is a largely theoretical one. Would Roseae rosa called Roseae ASV a4bc smell as sweet? Is there a distance benefit in your analysis or your ecosystem in having the specific taxonomic label? Is enough known about your ecosystem for there to be specificity? (I have no idea about skin)

It's probably also worth a search of the forum on this topic, because there's been a fiar bit of discussion in the past

https://forum.qiime2.org/search?q=species

Best,
Justine

vetalinesantana · August 13, 2021, 10:52pm

Hi @jwdebelius,

Wow!!! Thank you so much for your explanations

I appreciate your expandable list!!

"species that are sometimes very interesting to researchers can't actually be identified by 16S rRNA sequences..." haha That's so true!! I was really disappointed because I didn't find sequences of various bacteria/fungi that I was sure I would find.

This whole discussion is SO interesting....I really need to go deep!!

Thank you for the articles' suggestions!! Certainly, I will take a look!

Best,
Aline