Incorrect reading with Classifiers

Hi, I was wondering if anyone could help or has experienced this. I have tried using both a self trained and pretrained classifier for bacterial samples and both are not assigning a majority of the sequences in the standard. Is this normal or could it be something wrong with how the samples are processed?

Hi @hmarti,

Could you explain the issue a bit more? Is the problem your standard isn't being assigned at all (like it's all labeled k__Bacteria) or that you can't get species level resolution that matches the reference, or something else?

What kind of standard are you using? What 16S primers? How do you build your classifer/what classifier did you use?


Hi Justine,

The standard is being assigned but a majority are simply k_bacteria with a couple unassigned and some that are assigned are incorrect or is assigning classification not found in the standard at all. For the prebuilt one is was from qiime2 data resources [Greengenes 13_8 99% OTUs full-length sequences, the standard being used is ZymoBIOMICS D6305, variable region V4-5515 f, 926 r primers , as for the self built classifier I followed to moving pic. tutorial however t is the first time I've built so I am not super confident in it.


Hi @hmarti,

A couple thoughts from me (and @SoilRotifer behind the scenes :slightly_smiling_face:)

It sounds like there are two potential issues. The first one is that you havea lot of reads that are being under assigned. Do you know that your read orientation matches your database? You could try something like a vsearch classifier or RESCRIPt's orient-reads before you build your nb classifeier to check.

If this happens, it can also lead to wildly miss-assigned reads.

But, I think even after you fix taxonomic assignment you'll still have this issue:

With a standard and a database like greengenes 13_8 you're up against a lot of challenges.

  1. Taxonomic annotations are relatively unstable and names change faster than databases. Especially databases that are 10 years old. There's stuff that's contested in greengenes (denoted with [contested name] that has been solved in newer databases.

  2. There are things that cannot be accurately classified to whatever level they're claiming. You're not going to get an E. coli annotation. You'll get a f__Enterobacteraceae; g__ in greengenes of a "g__Shigella/Escheracia" in Silva. Because biology and what we call things don't always line up. (My friend has a :dog: called "cat". It happens)

  3. Microbiome analysis is not without cross-sample contamination. Someone wrote a whole paper on it!, What ends up in your sample is dependent on what's around it, and if there's any index switching that happened. So, you might detect taxa from outside the community in your sample.

  4. There's also a very rich literature on reagent contamination, which can introduce things you don't expect. There's a 2014 paper by Salter et al that's sort of a horror story, and a lot of recent work on the correct handling of contamination.

You may also want to check out this paper (again H/T @SoilRotifer). I just added it to my reading list, it looks awesome!



This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.