High relative abundance at family level. How can I increase the % at genus level?

katerina_nik · January 11, 2020, 1:58pm

Hello Qiimers,
I am currently using Qiime2 on 16S paired-end data for the V3-V4 region. Everything works smoothly up to the point I get the relative frequencies per group. I get 63% of a specific family (e.g.Enterobacteriaceae) while the rest are taxonomically assigned at the genus level. My question is if this is normal. Is it possible that the taxonomic analysis cannot go deeper (at genus level) for such a big amount of features? What would be your suggestions in that case?

Thanks a lot!

jwdebelius · January 11, 2020, 2:01pm

Hi @katerina_nik,

It's normal that Enterobacteriaceae is assigned only at the family level. It contains two genera - Escharecia and Shigella - which cannot be distingished by 16s. However, I'd be slightly concerned about having so much of that family - depending on your sample type. Like, relatively healthy adult humans (not C. diff, not Crohn's, not food poinsoning, not ICU), I would expect to have 1-5% Enterobacteriaceae. I'm less certain about babies, but I think it shoudl also be in that range. I'd be cautious with other lab-associated animals. But, Im not sure about your enviroment, so it may or may not be expected or appropriate.

Best,
Justine

katerina_nik · January 13, 2020, 9:35pm

HI Justine,

Thanks for the reply. I have checked other references relative to my study and although Enterobacteriaceae are expected to be found, the relative abundance is not that high as in my case. is there any way I could increase the percentage of the Enterobacteriaceae assigned at the genus level?

thank you once again!

Nicholas_Bokulich · January 13, 2020, 11:44pm

What kind of sample do you have? How was it collected and stored?

katerina_nik · January 14, 2020, 3:24pm

Hi Nicholas,

They are insect samples. They were collected in the field and stored in pure erthanol at -20oC until use.

Nicholas_Bokulich · January 14, 2020, 5:41pm

Hi @katerina_nik,

Thanks for clarifying. Enterobacteraceae often bloom when certain samples types (like feces) are stored at room temp for even a short time without some sort of stabilizer. Sounds like that is not the case with your samples but just wanted to check!

As @jwdebelius noted, it is difficult to differentiate genera in this family with lots of 16S regions because those regions are identical between some of these genera. So your observation is common: most taxa are classified at genus level but Enterobacteraceae stick out like a sort thumb.

You might be able to do better with a bit of elbow grease. A few possibilities:

If you have prior information about species abundances in your sample type, you could give q2-clawback a spin: Using q2-clawback to assemble taxonomic weights
You could check out the classify-consensus-vsearch classifier to get a "second opinion" on the Enterobacteraceae sequences. While this classifier usually performs no better than classify-sklearn, it is a bit easier to fiddle with the parameters to adjust things like % identity threshold... it also has options for finding exact matches and only considering top hits.
You could reduce the --p-confidence with classify-sklearn... this will increase recall and reduce precision (i.e., reduce risk of underclassification (what you have now!) but increase the risk of getting a false-positive genus or species classification)
You could create a custom database of Enterobacteraceae species (i.e., grab an existing database and exclude all species that you know could not possibly exist in your insect specimens!). I am not a fan of this approach — I highly recommend approach #1 to making a custom database since it utilizes more information instead of throwing out information and making dubious assumptions (in other words, "never say never").

katerina_nik · February 10, 2020, 7:40am

Thanks a lot for your time and suggestions @Nicholas_Bokulich and apologies for the late reply. I decided to leave it as it is, but in the meantime I would like to know which genera are included in there. I searched in the forum and found that I need to use my taxonomy file to identify the different feature IDs that correspond to Enterobacteriaceae and then identify these feature IDs in my representative sequences file and blast them. I did it for 5 of them but I was wondering if this is the right way or the only way. It takes quite some time since the feature IDs for Enterobacteriaceae are quite a lot and blast is slow as a process to be done in one by one. Am I missing anything? Thank you once again!

Nicholas_Bokulich · February 10, 2020, 3:39pm

Hi @katerina_nik,
I would not recommend that approach as a reliable or efficient way to identify these. See my recommendations above.