After reading the great paper on nature biotech (with my limited math understanding), I
have some questions about interpreting a PERMANOVA test based on a ctf distance matrix. I'm specifically interested in testing whether there have been changes in the microbiome of two groups (treatment and placebo) over three time points (0, 1, 2).
Typically, without ctf, I would use adonis with a group*time interaction to test whether there have been changes between groups over time, or I could test whether there have been changes within each group over time. However, I'm unsure if I should use the same approach when using ctf. Since my design assumes no differences at baseline, unlike the Crohn's tutorial, would it be more appropriate to use all samples from all time points and test whether those samples differ from each other given the subject vector?
In other words - if I model a PERMANOVA with a gemelli distance matrix of samples of three time points - would it be appropriate to coclude that this test output is the difference between treatment and placebo, "taking into account" the influence of time?
Would love to get some thoughts on this potentially exciting and useful tool.
CTF does indeed take account of the repeated measures but it is not truly a time series tool since time order does not matter. I don't believe that adonis/PERMANOVA currently takes the structure of repeated measures into account (assumes all samples are independent).
So there are two ways I do this kind of test. (a) PERMANOVA at each time point between subjects by phenotype group. From this, you can basically infer the same info as the interaction term (i.e., are all time points significant or just certain ones) but you are missing some of the benefits of a true LME type of test. (b) You can use the q2-longitudinal plugin's distance from baseline with the LME model on those distances.
I believe there may be a PERMANOVA/adonis version out there that can take into account repeated measures (see here). But I personally do not have much experience with it.
Do you mean that after performing CTF, I could perform PERMANOVA at each time point between subjects by phenotype group using the distance matrix obtained from CTF? Or, do you mean that I can use a non-tensor method (e.g., RPCA) at each time point (cross-sectional) to infer conclusions?
In the Nature Biotech paper, you compared ctf F statistics drawn from PERMANOVA compared to other methods. In my understanding, in all methods you have used samples from all time points, and tested whether phenotype differ from each other "regardless" of time, without taking into account the time interaction effect. Is that a fair conclusion? I think this preprint (https://www.biorxiv.org/content/10.1101/2023.01.13.523991v1.full, Fig. 1) have also used the same approach.
Although this approach (if I interperted correctly) draws different conclusions from time interaction, it is still very valuable.
Sorry for my many questions, but this is really a great tool and I am happy for the opportunity to hear the author opinion on this subjuect.
Do you mean that after performing CTF, I could perform PERMANOVA at each time point between subjects by phenotype group using the distance matrix obtained from CTF? Or, do you mean that I can use a non-tensor method (e.g., RPCA) at each time point (cross-sectional) to infer conclusions?
After CTF, separate the distance matrix by time point and then run PERMANOVA on each one by phenotype to see if it is significant at that time point. CTF would be preferable to RPCA which would assume all the samples are independent (which they aren't in repeated measures).
In the Nature Biotech paper, you compared ctf F statistics drawn from PERMANOVA compared to other methods. In my understanding, in all methods you have used samples from all time points, and tested whether phenotype differ from each other "regardless" of time, without taking into account the time interaction effect. Is that a fair conclusion? I think this preprint (https://www.biorxiv.org/content/10.1101/2023.01.13.523991v1.full, Fig. 1) have also used the same approach.
Indeed, you can get a value of if there is significance across all time that way without accounting for the subject. But in the benchmarking of the original paper we separate by time point to see the difference. each one. For example, in our CodeOcean here in 2.0.0-ecam/2.1-PERMANOVA.ipynb (cell #3) we split the distances by month and tested for birth mode differences at each month.
Although this approach (if I interperted correctly) draws different conclusions from time interaction, it is still very valuable.
Correct. Interpretation is key. Testing the beta-div significance is an important step because it will basically tell you if there are any microbe log-ratios that separate by your phenotype of interest. At the end of the day, the final test is to use the feature rankings along an axis of subject separation to choose log-ratios, which can go directly into an LME model as the response variable (see here for example).
Don't hesitate to ask more questions if you have them, thanks for using the tool!
When I compare phenotype group by axis using the output distance matrix, I get some separation by phenotype in PC1. However, when I use the SampleTrajectory output, the separation is along PC2, which has almost the same numbers as PC1 from the distance matrix. My question is, if I want to use feature rankings from the FeatureTrajectory output to choose log ratios, should I choose PC1 or PC2?
fAlso, for plotting trajectories (similar to https://www.sciencedirect.com/science/article/pii/S2666634021002038), would you reccommend using the axis values of the distance matrix or of the SampleTrajectory? Is their a way to extract the proportion explained by axis from the SampleTrajectory?
Great question! There are two types of outputs from CTF that you are describing and could use for the log-ratios:
Subject level subject_biplot.qza (this is what is used in the tutorial)
This is a biplot where the sample (dot) in the plot are subjects and the separation is based on how they differ in the microbial dynamics across time (or whatever repeated measure - could also be space). The arrows in this biplot are the features (microbes in this case) that best separate the subjects, since this is a compositional biplot you want the ratio of arrows pointing in different directions.
For most purposes, there are a few groups of subjects based on phenotype that will separate best by some PC axis in the subject biplot. It is that axis that you want to use in Qurro and for ranking features to choose in log-ratios.
Sometimes the time/space component is more complicated, like in infant development, so you may want to really dig into the dynamics. The state_subject_ordination can show you the trajectory over time across each PC (you can view this in a Q2 volatility plot). Moreover, the ratio of features that best separates the groups across time may not be static due to microbial successions over time. In this case, find the PC that separates your subjects by the given phenotype of interest in the volatility plot. Then the state_feature_ordination will give a ranking of all the features along each PC axis at each time point (while still taking into account the repeated measures) which can be used to choose your log-ratios. We used this kind of approach in the infant seeding paper you mentioned and in the ECAM dataset in the original paper. You can see this done in both the ECAM and DIABIMMUNE CodeOcean notebooks in 2.0-ordination-plotting.ipynb in the fourth cell (we also look at how many features we need to sum in the numerator and denominator so we don't loose samples). To do this you may want the .qza as a CSV/TSV file which you can export as so in the API/CLI:
import qiime2 as q2
import pandas as pd
# export to DF
feature_trajectory = q2.Artifact.load('path/to/state_feature_ordination.qza').view(pd.DataFrame)
feature_trajectory.to_csv('path/to/state_feature_ordination.csv')
subject_trajectory = q2.Artifact.load('path/to/state_subject_ordination.qza').view(pd.DataFrame)
subject_trajectory.to_csv('path/to/state_subject_ordination.csv')
I would like to add another question, if that's possible
We said one could test PERMANOVA based on ctf distance matrix and search whether there is a difference between groups across all samples (e.g., across all time points). Would it be necessary to include "time" as a variable in the Adonis function, or does it influence is taken into account during the ctf?