Assessing only taxonomic diversity for each sample?

ojholland · May 2, 2019, 7:29am

Hi there,

I'm really new to using QIIME2 and bioinformatics in general. I am currently assessing a small population of samples and I have followed the moving pictures tutorial as the basis of my assessment. I have generated all artifacts up to the bar chart visualisation step. I want to know if there is a way I can see how many unique assigned taxonomic OTUs there are in each individual sample from a pure diversity standpoint; a taxonomic diversity assessment for each sample. For example:

Sample 1
OTU1
OTU2
OTU3

Sample 2
OTU1
OTU3
OTU4

Sample 3
OTU3
OTU5
OTU6

and so on..

I want to do this so I can then take the taxonomic diversity from each sample and calculate how many samples are required to capture 95% of the variability.

Thanks very much.

jwdebelius · May 2, 2019, 7:50am

Hi @ojholland,

Welcome!

But, also, whoa boy, this is a loaded question because it sounds like a power question, and... power is a complex problem, particularly in microbiome research... (Also one I happen to really like.) As such, I've moved it over into a general discussion topic vs user support, because its a lot more theoretical there.

This sounds a lot like something you might want to tackle with beta diversity. Its not a measure of purely unique ASVs/OTus/whatever, but it is a measure or shared or unshared features. Jaccard would address this question nicely: its the fraction of shared OTUs over the fraction of total OTUs.

You've got a potential problem here: training a dataset to measure this is hard because microbiome data is inherently sparse, particularly in free living organisms. Additionally, it's regulated by sequencing depth and the number of samples. If I've got an observation that shows up in 10% of my samples at a depth of 1/5000 sequences and I sequence 10 samples to 1000 sequences/sample, I may not see the feature... or I might only see it in 1 sample. If I've got 100 that I've sequenced to 1/10000, I may actually see the feature in like, 5 or 10 samples. For this to work, your experimental parameters have to be pretty fixed.

From a calculating statistical power standpoint, Kelly et al developed a method to address statistical power with beta diversity. Your actual milage may vary (it's only really implemented for one metric), and IMO, it tends to under estimate power for real experiments, but if you need a power calculation, that's my recommendation.
Worth noting that this is implemented in R, not QIIME 2, but that qiime2-R is a brilliant package which we are all lucky to have and will get your data over nicely.

If you just want to capture alpha diversity, you can (mostly) model it with a standard power calculation, with a non-parametric penalty. In my experiment, it's asymptomatically normal for unweighted metrics, but best to penalize anyway.

In terms of feature-based power... there's not a great formal power calculation, but just assume that you will commit type II error. Current though is that OTU counts follow a power law distribution, and again, what you see will be a function of depth, sample size, and technique. ANCOM/Phylofactor/PhILR/Gneiss don't really have power calculations because the partitions are non-independent among a whole bunch of other problems, so... But, also just make the assumption based on GWAS studies which require hundreds of thousands of samples to detect SNPs in a at least somewhat common genetic background (although my resident geneticist would be laughing at this oversimplification).

Best,
Justine

ojholland · May 27, 2019, 3:09am

Thanks for your comprehensive reply @jwdebelius!

I neglegted to mention that I am not working with microbiomes for this project. Instead I am focusing on the 23S gene region as part of a dietary study of a herbivorous gastropod using gut contents. The reason I want to calculate how many samples are required to capture 95% of the variability is because I want to calculate the minimum number of samples required to perform comprehensive taxonomic analysis.

I produced a pairwise Jaccard distance matrix, however I don't think this is what I'm after. After a bit of fiddling around I realised the general information I needed was in my feature table, which I collapsed and used vegan in R to do some basic diversity analyses. This has produced plots with enough detail to outline an approximate level of sampling required for my work.

Thank you for your help anyway. I have another question but I'm going to open a new thread for it.

jwdebelius · May 27, 2019, 4:39pm

Hi @ojholland,

Im glad you found a solution!

Best,
Justine

system · June 27, 2019, 10:39pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.