High level of Kingdom only for Taxonomy

Hello back again with another Q.

I’ve been attempting to generate taxonomy for my samples, however, I am receiving a fairly high level of kingdom only for certain samples. I am studying two GIT sites, one is coming back with good taxonomic classification while the other site is coming up with a high levels of Kingdom_Bacteria only ~367 features out of 1978(only 11 features were unassigned). There is a pretty clear split between the GIT sites on what is classified, and the kingdom only is up to 96% of some of my samples…

Initially I used a classifier I trained on gg_13_8, 97% with 515f/806r, then I attempted to use SILVA and got similar results to the first. I even tried with the pre-trained gg classifier.

I attempted to use classify-consensus blast and vsearch with the default settings however they just replaced the bacteria level to unassigned. I feel if I was using the wrong reference sequences/classifier the bacteria level would not be so site specific. The sequence lengths are the same for these features and the others being classified.

I have been blasting these and majority are coming up as uncultured bacterium or uncultured organisms when I exclude uncultured bacterium… and weirdly a few are coming up as fungi…

Any suggestions on how I might be approaching this wrong/how this could be fixed? I can’t imagine my samples would have that many unknown bacteria…

Thanks in advance,

Hi @GillSie,
I went through a similar issue some time ago and turns out it was a case of non-target amplicons which I am convinced is a lot more common than people expect in intestinal tissues (compared to say fecal samples). If your situation is the same, and there is reason to think it is, I gather that link can provide some guidance, in short, get rid of those reads…you can use a positive filter similar to what deblur uses by default.
The fact that this issue is occurring only in certain samples might be a biological clue, is there anything similar to those samples? Perhaps the way their DNA was extracted, maybe you had very low biomass or DNA yields? They went through extra PCR cycles etc. It could just be bad luck too with contaminants taking over.
Lastly, what is the coverage and % identity values when you are blasting these? Uncultured bacterium is unfortunately not very reliable in BLAST and I personally would exclude these from my analysis.


Hi @Mehrbod_Estaki,

I’m sincerely sorry for the delay. Its been one of those weeks!

Ooooof, I was REALLY hoping this wasn’t going to be the answer, however, my situation looks very similar to your issue you linked. Like you, for some of these bacteria only classified seqs make upwards to 98% of certain samples. Because I am comparing the two regions I was hoping not too lose too many samples. Both regions of the GIT were extracted the same, but this area did tend to have lower DNA yields. The only difference between the issue you had and my samples are that (so far) I have not had my host DNA pop up in BLAST? Although I am not studying a mouse or human-host so maybe that’s why it isn’t showing up.

When BLASTing the sequences that come up as fungi, I’m getting many that are 94-100% coverage at around 1e-10 and lower. I have also gotten some bacteria at similar values. However, I have also found some contaminates here and there. I might try BLASTing a few more excluding uncultured bacteria and uncultured organism and see what results I get, but I believe your suggestion will be the most likely option that I have to follow… it stinks as it will mean I will lose a good chunk of my samples for that GIT region. What are the chances my host/GIT region are novel enough that their microbial community are relatively unknown :crossed_fingers: … No, I know, wishful thinking! Like you, I’ll most likely need to filter and forget!

Thank you for the suggestion and I will keep you posted on how things turn out.

1 Like

Hi @GillSie,

That is very possible, if this is a rare species that hasn’t been previously sequenced then it’s possible that you are not getting any hits because there are no reference sequences. It’s hard to infer any more without knowing more about your specific samples.
I know it seems like a huge loss to have to remove these reads but consider that you are removing junk and this is far better than the alternative which is to analyse junk and produce faulty results.

That really depends on your environment really, what the literature has on the topic, and if you expect vastly unknown bacteria. It is always possible that you have some novel species in your data, but that is a loaded discussion on its own as to how to go about determining if you have novel taxa and confirm with culture methods etc.

Hi @Mehrbod_Estaki,

You’re probably right! I went with what you said and filtered them out and its time to forget about them!

Thanks for all of your input as always.

1 Like