Hi! I am running QIIME2-2023.2 and am looking to analyze paired data. For example, let's say I have 16S data obtained from fecal and intestinal samples from 10 mice at a given timepoint. I cannot use PERMANOVA as the data is not independent.
I looked into doing pairwise-distances based on the below post, but I am running into the same problem where I do not have a group column to compare them.
I assume pairwise-differences would also not be ideal since it compares whether the value of a specific metric changed significantly between pairs of paired samples. What would be the best way to go about this kind of analysis? The goal is to determine if there is a correlation between intestinal and fecal microbiomes within each subject as well as globally (Across all samples).
What might make sense is for you to use mantel here, as I think you were getting at here.
It sounds like you have samples from the same subjects at multiple body sites. If you want to know if the distances between subjects are correlated across body sites (i.e., if two individuals differ in their fecal microbiome, do they also differ in their intestinal microbiome), mantel will get you that information.
Use qiime feature-table group to group samples in each of the "body site feature tables" (i.e., those resulting from step 1) by subject. In effect, this just renames the samples in each feature table to the subject identifiers.
Run qiime diversity beta on each body site feature table to generate a distance matrix between the samples in that table.
@gregcaporaso: I have a follow-up question. Let's say I am looking at three body sites i.e. tumor, intestines, feces. If I skip step 2 in the above steps you mentioned, would the mantel test give me group-wise correlation? I am interested in finding out combined trends/correlation across body-sites.
Secondly, what should be the preferred beta-diversity metric for these? Jaccard/Bray Curtis/Correlation?
I ran a Mantel Test using Jaccard Distance Matrices for intestines and feces and got the following results:
Hi @macrobiome, The group step is how you're able to map the paired samples across body sites so you won't be able to use mantel if you skip that step. I'm not sure what you mean by group-wise correlation, but this workflow should be giving you correlation across body sites, accounting for the fact that your samples are not independent of one another (i.e., you have multiple samples from the same individuals). If this isn't what you're looking for, can you clarify what you mean by group-wise correlation?
I would recommend starting with unweighted unifrac, if possible. Bray-Curtis and/or weighted unifrac would be good next options. Jaccard doesn't always work great, so it's possible that you'll see a significant correlation with one of the other metrics, even though you don't see one with Jaccard.
Hi @gregcaporaso, thank you! Correlation between body sites is what I meant.
I did not see unweighted/weighted unifrac as options in any of the diversity pipelines. Is creating body site specific metadata tables and rep-seqs and then running qiime phylogeny and qiime diversity core-metrics-phylogenetic the fastest way to get the unifrac distance matrices?
I ran bray curtis through qiime diversity beta and qiime diversity mantel. I do not see any correlation:
Yes, the easiest way to use unifrac is qiime diversity core-metrics-phylogenetic.
I faced a roadblock in this as I needed to generate rep-seqs for each body site but I used the grouped table as input to do so. This subsequently led to problems down the road while running core-metrics-phylogenetic as it was not recognizing the subject column so I had to replace all the sample IDs with the respective subjects to run core-metrics-phylogenetic (Hopefully this was the right thing to do).
Here are the results, however the p values are still greater than 0.05. Does a high p value mean there is no correlation between the two sites after factoring in individual subjects? Based on my (limited) understanding, the null hypothesis in Mantel Test is that distance matrix 1 is independent of distance matrix 2, therefore in this case, null hypothesis is not rejected and the distance matrices are independent. (Side Question: What if p value was less than 0.05? Does that mean there is some dependence (correlation) between distance matrices?)
You shouldn't need to generate rep-seqs on a per-body-site basis - rather you use the rep-seqs from the whole study to build your tree, and then use that tree for each of the body-site-specific analyses. That should be a better approach than per-body-site trees, as you'll know that differences in the tree aren't impacting the diversity metric. So you should be able to do that, and then the split/group workflow I described above.
Your interpretation of the p-values is right here. This can effectively be interpreted as any correlation test result (e.g., Pearson or Spearman).
In your first figure above (for unweighted UniFrac) you seem to be getting close, suggesting that if the sample size were larger you might see a correlation. I'm not sure offhand why the sample size would be 6 in that table - I would expect it to match the number of points (15, by my count). Is that the right table for that plot? If you don't mind, would you share the .qzv file that you're looking at so I can take a peek?
Thanks for your reply. In the results I shared above, I generated separated rep-seqs on a per-body-site basis because I needed to run core-metrics-phylogenetic (To get the unifrac distance matrices) which needed a rooted-tree.qza file. Would I be able to just use a body-site specific metadata, run core-metrics-phylogenetic to get body-site specific rooted-tree files? I thought it might give me an error saying that the rep-seqs and metadata don't align.
And for the 6 samples, I think there was an issue in my metadata when I generated body specific rep-seqs and it skipped one of the samples, I can fix that and reshare my results based on your thoughts on the above question.
Apologies for the slow reply, I lost track of this discussion. My mistake!
What I would recommend doing is generating your feature table, rep seqs, and phylogenetic tree for all of the data combined, and then splitting the feature table by body site for your core-metrics-phylogenetic runs. For each of the per-body-site runs, there will be extra feature ids in the tree, but that's ok - QIIME 2 allows for extra ids in trees, metadata, and representative sequences with respect to what shows up in your feature table. (The opposite - feature ids that show up in the table but not the tree - is not allowed.)
The reason I recommend doing it this way, rather than creating per-body-site trees, is that this approach removes variability across trees as a variable that can impact results. Because they all use the same tree, that is not a variable in your analysis.