Why don't people use pca or pcoa for analysing microbial co-occurance

markgonzalo · January 17, 2024, 11:29am

I think it is reasonable to use pca or pcoa to analyse microbial co-occurance where the abundance across different sampling sites are used as the features. But I see no one do so. Can someone tell me why is it?

jwdebelius · January 17, 2024, 3:54pm

Hi @markgonzalo,

Welcome to the :qiime2: forum!

I want to break this down into a couple of issues!

So, first the title:

They sort of do! I'm one of them (see figure 3). I think this factor-type analysis is less prevalent for a couple of reasons I'll get into below. But, people do it.

QIIME even has a plugin, qurro to help you construct factor ratios (ALRs) from gemelli rPCAs and CTF data.

But, I think there are some other issues in the proposition here!

Terminology

I think some fo the question here is the defination of co-occurance. When I think of co-occurance, I think of constructing a network to look at the relationship between the features. I think of how features cluster in an ordination space as more of a factor analysis. So, this might just be a terminology question! Becuase of my background and associations, Im going to call the analysis a "factor analysis" rather than co-occurance, if that's okay.

Ordination type

It's reasonable to use PCA because PCA has the ability to map features and samples directly into the same ordination space. (I recommend the rPCA described above). PCA takes the feature tbale as an input, and so its able ot map the features.

PCoA is an ordination based on distances. The feature information is lost in that distance calculation, and so it's much harder to map back. You could do something like construct a biplot, but I dont know that it's the same.

Personally, if I want features and samples in the same ordination, I stick to PCA because that's how it's set up.

Compositionality

You know how microbiome datasets are compositional and this is generally not optional? One of the solutions we have to use for compositionality is a log ratio. This means that our factor loadings are often log-ratios (frequently additive log ratios) instead of something a little more simple.

In my experience, lots of people dislike polymicrobial ALRs because they find the idea of the log ratio a little hard to work with. I dont think that's actually a major limitation, but it's a caveat to be aware of.

Everyone wants the one microbe :microbe: to rule them all :ring:

I think there are a number of cultural reasons that most people dont like polymicrobial solutions. So, I think there's a reticance in this area to do factor loading or other multi microbe statistics. Maybe it's the legacy of Koch's postulates, maybe it's something else.

Again, you can totally publish them. It just won't satisfy everything.

Best,
Justine