pairwise differences on Axis 3

nathaniel_hubert · March 22, 2023, 5:25pm

Hello,
I have samples of two types (A/B), before and after a treatment (pre/post).

(There is no reason to hypothesize A would have changed more than B, or in a different direction; both were subject to the same treatment and likely had different starting communities)

I ran pairwise-differences on my PCoA axes to see if there were any predictable directional differences in pre/post. Axis 3 was significant.

I generated a biplot and there are clues as to which taxa may be driving this, but as I found in the forum, the most important taxa are determined by their vector magnitude on PC1.

I ran ANCOM-BC with pre/post and A/B as interactive and cumulative independent variables and have a list of significant taxa. I can run pairwise differences on the abundances or relative abundances of these taxa and generate boxplots. (It would probably be better to run ANCOM-BC2 in R where it is possible to run a paired analysis and may try this but it is intimidating for a beginner)

I was hoping there may be a more straightforward way to find which taxa are driving the difference between pre/post, and possibly contributing to interactions between pre/post and A/B.

I was thinking it may be a good idea to run Spearman on Axis 3 vs taxonomic (relative) abundances. Is there a good way to do that in QIIME?

Is there a way to generate a biplot (and list of important taxa) using Axis 3 as the axis of importance?

Do you have any other suggestions for how to find the significantly important taxa and visualize their impact? The sample classifier heatmaps are interesting and may be of use, but I would imagine RandomForest methods aren't as robust as ANCOM-BC and pairwise differences.

*Edit:
I also tried longitudinal feature-volatility, and only one taxon was in the important features result, with importance = 1. When I set feature-count to 10, a different resulting single taxon with importance = 1. Not sure if I am doing something wrong here:

qiime longitudinal feature-volatility
--i-table filtered_table.qza
--m-metadata-file QIIME_map.txt
--p-state-column pre_post
--p-individual-id-column pair
--p-feature-count 'all'
--output-dir volatility_032223

Any suggestions are greatly appreciated! Thank you, Nate

jwdebelius · March 23, 2023, 12:12am

Hi @nathaniel_hubert,

I think rather than the biplot and correlation you might want to look into either complex tensor factorization in Gemelli or rPCA in DEICODE. The advantage of these over traditional metrics is that the features are embded in the ordination, so you can use them to figure out what side of the ordination features are associated with... or to even build ALRs if you're into amalgamated microbial statistics. I happen to really like amalgamated values

If it's helpful or interesting, I'll mention that I recently did something similar and our preprint is out. (Manuscript is still under review). I used Gemelli to look for directional changes between tissue types between two survival groups. Sub tissue for time point and survival group for treatment, and I think its similar.

Best,
Justine

nathaniel_hubert · March 24, 2023, 12:20am

Thank you @jwdebelius !
I am looking into those approaches now.
Very cool paper, and cool approach.

I haven't heard of these methods, but this tutorial looks very straightforward:

github.com

biocore/gemelli/blob/master/ipynb/tutorials/IBD-Tutorial-QIIME2-CLI.md

Repeat measure experimental designs (e.g. time series) are a valid and powerful method to control for inter-individual variation. However, conventional dimensionality reduction methods can not account for the high-correlation of each subject to itself at a later time point. This inherent correlation structure can cause subject grouping to confound or even outweigh important phenotype groupings. To address this we will use Compositional Tensor Factorization (CTF) which we provide in the software package [gemelli](https://github.com/biocore/gemelli). CTF can account for repeated measures, compositionality, and sparsity in microbiome data.

In this tutorial we use _gemelli_ to perform CTF on a time series dataset comparing Crohn's and control subjects over a period of 25 weeks published in [Vázquez-Baeza et al](https://gut.bmj.com/content/67/9/1743). First we will download the processed data originally from [here](https://qiita.ucsd.edu/study/description/2538#). This data can be downloaded with the following links:

* **Table** (table.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/table.qza)
* **Rarefied Table** (rarefied-table.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/rarefied-table.qza)
* **Sample Metadata** (metadata.tsv) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/metadata.tsv)
* **Feature Metadata** (taxonomy.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/taxonomy.qza)
* **Tree** (sepp-insertion-tree.qza) | [download](https://github.com/biocore/gemelli/tree/master/ipynb/tutorials/IBD-2538/data/sepp-insertion-tree.qza)

**Note**: This tutorial assumes you have installed [QIIME2](https://qiime2.org/) using one of the procedures in the [install documents](https://docs.qiime2.org/2020.2/install/). This tutorial also assumed you have installed, [Qurro](https://github.com/biocore/qurro), [DEICODE](https://github.com/biocore/DEICODE), and [gemelli](https://github.com/biocore/gemelli).

First, we will make a tutorial directory and download the data above and move the files to the `IBD-2538/data` directory:

```bash
mkdir IBD-2538
```
```bash
# move downloaded data here
mkdir IBD-2538/data

This file has been truncated. show original

Just curious, is the method I proposed appropriate? Can it be done in QIIME (i.e., determine which taxa correlate with Axis 3)?

I am also wondering if there is an error in the feature-volatility commands I shared? I tested it with states that differ predictably and again only one taxon in the resulting important features.

Thank you very much! Nate

jwdebelius · March 24, 2023, 5:31pm

Hi @nathaniel_hubert,

I never wnat to say "never do this" because there are reasons you might end up doing this. However, the canonical and most recommend solution is a biplot, whch places features and samples int he same space. You essentially get a feature loading out, whcih tells you about their position in PCoA space. There shoudl be a biplot method in q2-diversity that you could apply, so that would also bea n option.

I've not run feature volitility, so I'm not sure. But, there are lots of reasons you might only get 1 organism:

You might not have enough statistical power
Your difference might be due to a variety of organisms across samples and you need to look at an aggregate statistc rather than trying to find a single bug to blame.
You might need to filter your data more stringently because you're paying too much of a correction penalty for sparse data
There might only be 1 interesting organism.

Best,
Justine

nathaniel_hubert · March 27, 2023, 2:43pm

Thank you, Justine!
I really appreciate your time and guidance.
Will let you know how it goes.
Nate