Partially subsample or not?

jwdebelius · April 15, 2019, 12:30pm

Welcome! Difference in sequencing depth are always a fun problem, but fortunately or unfortunately a common one!

This is difficult and something you need to consider carefully in your experiment, particularly if you're working with low biomass samples. Salter et al ran into one of my greatest fears with this experimental design: their technical variation got mis-interepreted as biological variation and lead to incorrect results. So, one of the first questions is whether or not you think the biological variation might swamp thing. MBQC might also be a good resource to look at.

Assuming that you're okay with that...

I'm a big proponent of filtering your data based on abundance in two aspects. First, you should be dropping out low abundance samples. My absolute base threshold is 1K reads/sample, but if there's a natural break in your data or other split point, I'd recommend using that. Obviously, you want to optimze for the number of samples retained while minimizing issues due to low sequencing depth. This tends to be closely related to rarefaction, a normalization technique based on subsampling that's frequently used with diversity methods. Weiss et al, 2017 is a nice summary of when to uses rarefaction. However, if you're interested in rarefaction-less approaches, you should also look into q2-breakaway, which is rarefaction-less alpha diversity and q2-deicode, which is based on Aitchison distance and therefore also rarifaction-less.

Next, your compositionally aware techniques should be able ot handle the differences in sampling depth because they're based on relative frequencies. I'd also look into q2-perc-norm as a cross-run normalisation method to feed into your compositional metrics.

Finally, you might consider bringing sequencing depth into a multivariate model. Although Adonis is usually used with rarified data, it is multivariate and useful for beta diversity. Rarefaction-based alpha diversity can be passed into a linear regression model if (a) your dataset is big enough to satisfy your assumption of asymptotic normality of residuals or (b) you run a permeative regression. (I'm not sure about breakaway and how you'd propagate the error through; Im sure its possible but Im equally sure I don't necessarily want to try.) For feature-based analysis, gneiss, phylofactor, and phILR all have nice implementations for compositionally-aware multivariate models. (Only Gneiss is implemented in qiime2, but Im on a phylofactor kick recently because the unrooted tree is just so nice).

Best,
Justine