Statistical analysis - looking for core microbiom


Which statistical analysis is preferred for answering my hypothesis?

I have bacterial metagenomic samples from two different infection sites, approximately 25 from each group. I want to test if there is a core microbiom that is shared by the two types of infection. Do you have any opinion what (statistical) analysis would be the best?

1 Like

Hello Ruben,

You have probably already found the qiime feature-table core-features plugin.

I’m not sure I understand what you are trying to test. I think you can report “The following microbes appear in 90% of samples in my cohort” and “but only these these microbes appear in 100% of the samples in my cohort,” without needing to report a p value.

Mehrbod discusses it more in this thread. Let’s see what he has to say!



Hi @Ruben,

You can use the core-features function which allows you to look for features that are shared between groups within some percentage.
There’s a similar thread with a bit more detail on the same topic here as well.
Good luck! :spades:

Thanks! I did´t know about that function. Quite new to qiime2.

I´m actual testing two things:
First, do the two kind of infections “share” the same bacteriology?
Second, is there a core microbiom/key pathogens shared by both infections?

For the second question I agree a p-value is not necessary. Its more like @colinbrislawn said that I will report which microbes are frequent in both infections.

For the first question. I´m trying to understand the PERMANOVA-test, and if that test is relevant for my material.

1 Like

Hi @Ruben,

First off my apologies to you and @colinbrislawn, I didn’t mean to repeat his post, I think we may have just posted our replies at the exact same time cause I never saw his while I wrote mine. But I’m glad we had the same idea and even thread in mind :wink:

I’m not entirely sure I understand what you mean by “shared” bacteriology, and especially how that’s different than your second question of core microbiome. That being said - and please correct me if I’m wrong - I think you’re referring to (dis)similarity of the community between the two infection groups, which the PERMANOVA test you referred to can certainly help with. In your case I would first create a PCoA plot using the qiime diversity beta function and visualize that using the qiime emperor plot tool. The choice of your distance matrix really depends on the experiment and the question being asked. Here’s a good brief summary of the different ones available in qiime2 and what they measure. You can use the same distance matrices to run the PERMANOVA test you referred to which would compliment the PCoA figure nicely. Here’s a great explanation of how the PERMANOVA tests work in general. The key thing to remember in these types of analysis is that they are dependent on the whole community of your sample as a whole, and are not univariate tests. As in they are not going to tell you which microbes alone are different between the two groups. For that type of testing you want to look at something like ANCOM or gneiss tools available in qiime2.
Another interesting approach, though I don’t think it is necessarily what you are looking for nor is it available in qiime2 currently, is using a machine learning decision tree tool like random forest. In this scenario you can train a model using a subset of your data to see if it can categorize an unknown sample into either infectionA or infectionB site. If the model is accurate in its categorizing then you can identify key bacteria that were important in deciding whether the sample belong to infectionA or infectionB. A great tutorial on this method if you fancy giving it a try is available here. Though as I mentioned it sounds like you can get what you want with a simple PCoA+PERMANOVA test.

Let us know if that helps!

Edit: Actually I just discovered that qiime2 indeed does have a type of supervised classification like the random forest link above. This is the qiime sample-classifier classify-samples. Neat!


I agree @Mehrbod_Estaki , PERMANOVA sounds like it would effectively answer the first question.

However, q2-sample-classifier would be a good way to answer question 1 and 2 in a single command. It would also report how frequent these infections and control samples cross-classify, so unlike permanova+ancom it would give some information relevant to @Ruben’s specific question.

@Ruben these methods are effectively going to show how different your sample groups are, and which taxa distinguish them. I would personally use all of the above:

  1. core-feature (thanks @colinbrislawn!) to show which features are shared between groups. (QIIME2 does not have anything like this, but a venn diagram of shared features could also be relevant to your question)
  2. permanova to test whether groups’ microbiota are significantly different
  3. pcoa to visualize the similarity between samples
  4. sample-classifier to determine how well you can distinguish these groups, and which taxa best predict group membership (e.g., diagnostic features).

Thanks for your reply @Mehrbod_Estaki and @Nicholas_Bokulich !

I didn’t know about the sample-classifier function, but maybe I’ll use it on my data!

We’ve had a discussion in our group about the venn diagram. Say that some microbes are found in just one sample from each of the two communities, I’ll call them “lone OTUs”. In a venn diagram the “lone OTUs” will be visualized as possible core microbes similarly to a microbes found in lets say 20 out of 30 samples in each community. We would therefore argue to exclude microbes that are only found in 1 (or 2 or 3?) samples when calculating the venn diagram.

Is this thinking transferable to unifrac and permanova analysis? Or will removing of the “lone” OUTs make the analysis invalid. I’ll guess a weighted unifrac and a permanova analysis on a weighted unifrac will reduce the impact from microbes found in just 1 or 2 samples. But when doing an unweighted unifrac analysis, is it possible that these “lone OTUs” are added too much weight?

Thanks for your opinion on this matter!


1 Like

Yes, I believe the would be the standard goal of a venn diagram — you only want to show the “core” microbiota present in all samples in that group. You could use something like a network plot if you want to really dissect these relationships. Such methods are not yet implemented in QIIME2, so you would need to use qiime1 or other external software.

No, it does not. You should not remove any of these OTUs prior to using these methods. Unless if you have a good reason for excluding them (e.g., if low abundance or “lone” OTUs are suspected to be contaminants)

only for that individual sample. This may make that sample look more or less like its group/other groups, depending on the phylogenetic placement of that OTU. So removing these OTUs will have unpredictable effects.

That said, you could always remove those OTUs and see what happens — I do not think that it would “invalidate” the results, but it would be an unconventional analysis and you would need to very carefully interpret the effects by comparing these results to the “normal” analysis.

I hope that helps!


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.