Question about rarefied data in Qiime2

eburchard · March 6, 2019, 6:35pm

Hello all,

I wanted to know if there is any difference, in terms of statistical outcome, between simply setting a sampling depth using “–p-sampling-depth” during the execution of “core-metrics-phylogenetic” and creating a rarefied table with “feature-table rarefy” and then using that table as input for “core-metrics-phylogenetic”? Up until this point, I have been doing the latter as I was taught that way, but I’ve often wondered what the point of doing so was when you can set sampling-depth in the “core-metrics-phylogenetic” command anyway. Now I’m a bit concerned that I’ve been perhaps skewing my data in some way by doing this. Have I been?

Thanks,

Erik

ebolyen · March 6, 2019, 6:53pm

Hi @eburchard,

The core-metrics-phylogenetic (and non-phylogenetic) pipeline runs rarefy, which is why the parameter is there. Assuming you have used the same depth, nothing interesting will change as the sub-sampling will draw without replacement, meaning eventually it will pull every feature out of your rarefied table into a new identically distributed table.

If sampling depth your value is less than your originally rarefied table, you may want to re-run your analysis. In the spirit of subsampling, I don’t think subsampling twice will matter, however I am NOT a statistician, so I cannot speak to what actually happens to all of the things like mean and variance in that case. What I do know is that there is a lot of valid contention on the notion of subsampling sequencing data, and so I don’t think anyone will be particularly impressed to hear it was done twice with different values

I think the third option, and the one that might fit you the best, is to not use the core-metrics pipelines all-together. You can compute the individual diversity calculations yourself via diversity alpha/beta[-phylogenetic] and you get a lot more control that way. All core-metrics does is string together rarefy with some generic diversity computations, you can certainly do that yourself and not waste time on measures/metrics that aren’t of interest to you.

Hope that’s useful, there’s a lot of different strategies to normalization, and rarefying is one of the simpler ones.

eburchard · March 6, 2019, 6:58pm

Thanks! That is the answer I was hoping for, since I used the same subsampling value twice! Phew…

Also, I will definitely go with option three from now on, as we only use a few of the computations generated in that pipeline anyway.

Cheers!

system · April 7, 2019, 1:06am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.