Analyzing subset of microbiome data

fquerdasi · April 30, 2021, 6:37pm

Hi!

I am relatively new to microbiome data analysis and I was wondering if someone with more experience could provide some advice.

I am collaborating with another research group who has been collecting 16S data on children at multiple ages. They have sent me a phylogenetic tree, an OTU table, and a metadata file for a dataset which includes samples at Age 1 and samples at Age 2.

My question is, if I am interested in analyzing data from these two ages separately, will having them in one dataset which combines the two ages and one phylogenetic tree produce different diversity results than if I were to separately generate phylogenetic trees and then diversity metrics for the sample at each age?

Thank you in advance for any insight you can provide!

jwdebelius · April 30, 2021, 10:30pm

Hi @fquerdasi,

Welcome to the :qiime2: forum!

I think the short answer is "possibly, but probably not enough to matter". (At least not in my practice.) If we break down the steps in your distance calculation:

Building the phylogeny - this is dictated by the avaliable sequences if you go de novo (i.e. MAFTT), partially dictated if you use fragment insertion, and ignored with closed reference OTUs. I dont know how much benchmarking has ben done around exactly where things end up in which trees based on the sequences, but the general asssumption is that they all generally end up in the same place. Or at least close enough for diversity-level estimation, which is what you're doing here.
Rarefaction. I'm assuming you plan do use something like UniFrac distance if you're asking about phylogeny. You'll need to do normalization and rarefaction. Because rarefaction is a stochastic process, you may see small variation in tables an distances based on which random version you get. This issue is more pronounced when you focus on unweighted metrics. You can avoid the rarefaction problem with Aitchison distance, or rPCA (which is a slightly different discussion ).
Distances themelves. Once you've calculated the distance, however you've calculateed it, the relationship betweeen the two points shouldn't change. Theoretically, the distance between my house and my favorite burrito place is fixed. It doesn't care about the distance between my house and the best place in two to get ramen . (Although there is a slightly confusing relationshsip betwen th three distances.) Or, to put it another way, the distances between cities in California depends only on the cities in California and doesn't care about whether or not your atlas also has information about Arizona.

My approach tends to be building one tree, rarifying once, and then calculating my distances and filtering. Some is sheer practicality: building my table and distance matrix are often the most computationally intensive steps in processing my microbiome data and I have to pay for super computer time. If i only do it once and then filter it, then it costs less. It also means that I'm dealing with fewer sources of stochastic variation in my analyssi, which can be nice for reproducibility, especially because QIIME 2 doesn't have a seed for rarefaction.

There are, however, certain caveats.

If you feature table changes, you have to re-calculate distance
If you filter your samples, your stats and visualizations need to be updated. Ordinations are based on the avaliable samples to drive clustering, so, if you hide samples, you're going to misrepresent your data.
Filtering samples holds true for the DEICODE rPCA technique: if you filter your data, you'll have to re-calculate.

Best,
Justine

fquerdasi · May 4, 2021, 4:53pm

Thank you so much, Justine, for that thorough and incredibly helpful answer!