Beta diversity changes on subset of samples

jjmmii · June 11, 2018, 7:37am

Hello, I noticed that upon subsetting my samples and re-running qiime diversity core-metrics-phylogenetic, when I export the e.g. weighted UniFrac distance matrix, and compare any two sample’s distance to the one calculated without subsetting, the values are different. I thought about this and believe it’s because beta diversity = gamma diversity / alpha diversity, so any change in the number of samples (or even just using different samples) the gamma obviously changes hence beta is affected. May someone please confirm this? Thank you.

Mehrbod_Estaki · June 12, 2018, 3:58pm

Hi @jjmmii,

While in theory your definition of beta diversity is not incorrect, in practice the distance between 2 samples should be the same regardless of presence/absence of other samples (i.e. the 2 subsets in your case). Are you by chance using different sampling depths between your comparisons? That can for sure result in different numbers. Or perhaps using different trees in your phylogenetic-based metrics.

jjmmii · June 13, 2018, 4:34am

Hello @Mehrbod_Estaki, thanks for your quick reply. I would hope so, but no, I ensured the rooted-tree.qza and sampling depth were identical for both the entire set and the subset. Here are my commands and results:
For the entire dataset:
qiime diversity core-metrics-phylogenetic --i-table table_pick.qza --m-metadata-file metadata_including_health_and_grouping.tsv --i-phylogeny rooted-tree.qza --output-dir core-metrics-results --p-sampling-depth 18348
For the subset:
qiime diversity core-metrics-phylogenetic --i-table table_pick_freqFilt100_WG.qza --m-metadata-file metadata_including_health_and_grouping.tsv --i-phylogeny rooted-tree.qza --output-dir core-metrics-results-WG --p-sampling-depth 18348
Commands were run in the same directory. To further prove rooted-tree.qza is the same file, attached are my weighted UniFrac .qza files: (101.2 KB)
(290.9 KB) You can check in provenance that it's the same.
But the betadiversities are really different. Take samples W4-PIS10006 and 1-PIS10001 as example. In the subset:

But in the entire dataset:

I wonder if anyone else observed this on their own data? May I please ask you try on your data?

Thanks.

Mehrbod_Estaki · June 13, 2018, 5:17am

Hi @jjmmii,

Thanks for providing your artifacts! It looks to me as though the distances are very close between the subsets, for instance in your example above the difference is only about .0013! If I had to guess I would think that this small difference arises from the random nature of subsampling. I couldn’t find anything related to the seed in the qiime2 github codes but perhaps someone more familiar with the source code can confirm this.
It is an interesting point though, and it may be a good idea to either fix the seed used for random # generation or expose a parameter that lets the user select the seed. This would ensure exact reproducibility.
In the meantime though you can carry on with your analyses without any worries, I very much doubt those fraction of differences will alter your results in any way.

system · July 14, 2018, 11:17am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.