Gemelli distance matrix vs. subject trajectory

kam · August 8, 2023, 1:06pm

Dear community,

I wanted to express my gratitude for the valuable insights provided by @cmartino, the main author, in response to my previous inquiries about Gemelli. While his assistance has significantly enhanced my understanding of this powerful tool, I must admit that I'm still grappling with certain aspects, especially in light of a different paper (Applications and Comparison of Dimensionality Reduction Methods for Microbiome Data - PMC) authored by some of the same individuals.

Specifically, my confusion revolves around comprehending the distinctions and applications of the distance_matrix and state_subject_ordination artifacts. In the benchmark outlined in the CTF paper (Code Ocean), section 2.0.0-ECAM - 2.0 ordination, it appears that standard PCoA dimension reduction was employed on the traditional beta-diversity dissimilarity matrix to elucidate variations along Axis 1. Conversely, the state subject ordination was used to visualize Axis 1 differences in the subject biplot. Subsequently, in section 2.1, the distance matrix was loaded to facilitate PERMANOVA computation, if my interpretation is accurate.

My queries revolve around the following points:

Can the distance matrix artifact be employed similar to other distance matrices derived from traditional dissimilarity matrices, for both visualization through PCoA (to visualize each sample) and computation in PERMANOVA? Or is its primary use limited to PERMANOVA analysis?
Apart from visualizing the PC1 trajectory, what are the statistical applications of the state subject artifacts?
Is the proportion of variance explained by PC1 (and PC2) from the state subject ordination artifacts derived from the subject biplot?

My intent is to gain a clearer perspective on how to proceed. Typically, a microbiome workflow involves generating a distance matrix, followed by PCoA visualization and subsequent statistical analysis. However, in this context, I'm unsure if the same approach should be taken with the distance matrix, or if the focus should solely be on using the biplot to visualize individual subjects.

Thank you so much!

gregcaporaso · August 8, 2023, 9:50pm

@cmartino, could you help with this question?

cmartino · August 9, 2023, 2:35pm

Hi @kam,

Thanks for reaching out. The way RPCA and CTF produce their distances matrix is not the same as traditional beta-diversity methods. The distance matrix is calculated from the ordination itself, specifically the Euclidean distance of U (the sample/subject/time loadings) and itself. Since it is transformed via the robust centered log-ratio transform ahead of that, we call it the Aitchison distance rather than Euclidean. So, with that background, to answer your three questions above directly.

No, I would not suggest running PCoA on the distance matrix. As you guessed, we really only produce it to assist in statistical evaluations like PERMANOVA.
Well you can also view PC2-PCN (depending on the number of components you specified in the command). But that is its purpose to give you a view of differences in the dynamics across groups that contribute the most variance to the dataset in question.
Yes, that is correct. Also, you may want to read this previous answer about interpreting those proportions explained, although the question was for RPCA it also applies to CTF.

Even though the workflow order occurs differently your analysis/interpretation workflow can be the same. CTF just outputs everything you need at once. I hope that helps.

Thanks,

Cameron

kam · August 10, 2023, 6:11pm

Thanks!

One final question regarding this point (subject trajectory):

In addition to visualizing the differences between the groups of interest, would it be suitable to utilize PC1 for conducting statistical tests on these differences? For instance, could we employ the Mann-Whitney test to assess variations between groups at each time-point or incorporate the PC1 values within a LME model for analysis?

cmartino · August 10, 2023, 9:45pm

Hi @kam,

Good question. I don't know definitively if doing testing on PCs values is valid, I have seen some do it (which does not necessarily make it right). It is much better, in my opinion, to use the feature loadings along that PC where you see the separation by phenotypes to choose a log-ratio. The log-ratio can definitely be used in more traditional statistical evaluations like LME and should help to also validate that the trends in the dimensionality reduction can be recapitulated with the data. This is done in the tutorial using Qurro.

Thanks again for using CTF!

Cameron