Hi @MicrobeManager,
I'm gooing to warn that you may want a
, 'cause this will probably be another long answer.
So, for each data set, to get to the PCoA plot, you've got like 7 steps. Some of the are defined processes - there's no randomness to them and so if you put the same thing in, you'll get the same thing out every time. Some are sample independent filtering samples won't make a difference on the results. Some are feature independent - filtering features won't change the results.
step |
defined process |
sample independent |
feature independent |
computational cost |
Import into QIIME 2 via a manifest |
 |
 |
-- |
 |
Primer trimming |
 |
 |
-- |
 |
Denoising with dada2 |
 |
-- |
 |
   |
Computing a tree(?) |
 |
 |
 |
   |
Rarifying your feature table |
 |
 |
-- |
  |
Computing distances |
 |
 |
 |
   |
Computing the PCoA dimensionality reduction |
 |
 |
 |
 |
You'll notice that there are a couple of undefined processes in the table. Dada2 and rarefaction both have an element of randomness in their algorithms that can change the result.
With DADA2, the model gets trained on a subset of sequences that get selected at random from the data, and this informs the error model. We hope
that the data is stable enough to tolerate this random element and that it doesn't make a major change to our results (it usually doesn't). The chimera filtering in DADA2 is also dataset dependent, and this can be influenced by the samples/sequences you put in. So, if you're changing your sequences, you're changing your results slightly. ...You also might be burning your computer, which isn't so bad with smaller studies but gets to be a pain when you get several hundred or several thousand. (It's easy to filter the tables, though! Check out the table filtering section in the data filtering tutorial and the q2-feature-table plugin for more filtering functions.)
You're also introducing some randomness into your process when you rarefy. Here, you essentially draw sequences at random out of your total population until you hit a certain depth. You select at random, and so you can potentially get a different distribution with multiple rarefaction interactions. (I cant seem to find the plot I'm looking for, but there was an example in the old emperor docs.) I think you might be able to solve some of the stochasticity by averaging over multiple rarefaction rounds, or you can just keep your one random step by filtering.
In general, rarefaction is relatively quick, but it scales with your sample size and depth. Im currently working with about 500 metagenomic samples, and it takes 20 minutes to rarefy my full table to 1M sequences. While it gives me the excuse of "cant work, data processing", its not terribly efficient if I want to make quick changes.
Next, you've got your distance calculation. This isn't random in that if you put in the same table and use the same metric, you'll get the same distances out. Your distances are always calculated pairwise) and the distance that gets calculated depends only on that pair fo samples. (It's slightly more complicated, but basically, the distance between LA and San Diego is the same whether we measure the distance from San Diego to San Francisco or not).
The challenge with pairwise distances is that even though they're identical, they take time to compute. There's been a lot of work int he past 5 or so years trying to speed them up for bigger datasets, but ultimately, these take time. Plus, you can always filter your distance matrix (see the tutorial here).
Now we calculate your PCoA based on the samples in your distance matrix and twist/stretch/flip the data to make the ordination work.
...But this ordination is a product of the samples in my distance matrix, which is the product of a stochastic rarefaction, which is the product of a stochastic process in DADA2.
Small changes in the upstream process can lead to sign flipping or slight changes in position, which could lead to filliping.
Yep! I'd do it on import and then forget about it. If you're ever doing a biplot, just make sure you flip your features the same way!
Best,
Justine