Partially subsample or not?

Hello,

I am writing to you, because I would really appreciate your opinion on an issue that I have with the data I’m currently analyzing.

I will try to describe my dilemma as succintly as possible:

  • I have NGS amplicon data for three groups.

  • The data was sequenced on two different runs.

  • Unfortunately, the sequencing facility did not randomize the samples as instructed and all samples from group 1 were sequenced in the first run, and the group2 and group3 were sequenced together in the second.

  • The number of reads in group 1 is much higher (100000+ compared to 10 000-20 000) than in other two groups.

  • I would like to use methods that take compositionality into account (like in phylofactor, or codaseq package) and also do not want to filter data by abundance or anything.

  • However, I am a bit worried by much higher number of reads in the group 1.

So my dilemma is:

  1. use the data as is

  2. subsample the group 1 samples to random (various) depths within the range of other two groups to account for any possible systematic variation due to the number of reads.

I do not have anybody in my vicinity to discuss this problem with, and I would really like to have some opinions on this before I start the main analysis.

Thanks in advance,

M

Hi @Maebh_Maebh,

Welcome! Difference in sequencing depth are always a fun problem, but fortunately or unfortunately a common one!

This is difficult and something you need to consider carefully in your experiment, particularly if you're working with low biomass samples. Salter et al ran into one of my greatest fears with this experimental design: their technical variation got mis-interepreted as biological variation and lead to incorrect results. So, one of the first questions is whether or not you think the biological variation might swamp thing. MBQC might also be a good resource to look at.

Assuming that you're okay with that...

I'm a big proponent of filtering your data based on abundance in two aspects. First, you should be dropping out low abundance samples. My absolute base threshold is 1K reads/sample, but if there's a natural break in your data or other split point, I'd recommend using that. Obviously, you want to optimze for the number of samples retained while minimizing issues due to low sequencing depth. This tends to be closely related to rarefaction, a normalization technique based on subsampling that's frequently used with diversity methods. Weiss et al, 2017 is a nice summary of when to uses rarefaction. However, if you're interested in rarefaction-less approaches, you should also look into q2-breakaway, which is rarefaction-less alpha diversity and q2-deicode, which is based on Aitchison distance and therefore also rarifaction-less.

Next, your compositionally aware techniques should be able ot handle the differences in sampling depth because they're based on relative frequencies. I'd also look into q2-perc-norm as a cross-run normalisation method to feed into your compositional metrics.

Finally, you might consider bringing sequencing depth into a multivariate model. Although Adonis is usually used with rarified data, it is multivariate and useful for beta diversity. Rarefaction-based alpha diversity can be passed into a linear regression model if (a) your dataset is big enough to satisfy your assumption of asymptotic normality of residuals or (b) you run a permeative regression. (I'm not sure about breakaway and how you'd propagate the error through; Im sure its possible but Im equally sure I don't necessarily want to try.) For feature-based analysis, gneiss, phylofactor, and phILR all have nice implementations for compositionally-aware multivariate models. (Only Gneiss is implemented in qiime2, but Im on a phylofactor kick recently because the unrooted tree is just so nice).

Best,
Justine

5 Likes

Thank you, Justine, for such a detailed and well argued answer. And super quick!

BlockquoteThis is difficult and something you need to consider carefully in your experiment, particularly if you’re working with low biomass samples.

It is acutally two types of tissues - one is low biomass, the other isn't (but the problem is the same for both, they just sequenced everything from group1 in one run).

Maybe I should also clarify that I'm interested in beta diversity here (I'm happy with subsampling and classical approach for alpha).

Interestingly, the results for the low-biomass tissue look the (more or less) same, regardless if I use Bray-Curtis and subsampled data, Aitchison on full dataset or applying my option 1.
It is actually for the high-biomass tissue where the results differ: subsampled (all samples, minimum depth) or abundance-filtered dataset don't distinguish between the group 1 and the rest, whereas the opposite is the case for the unsubsampled data set (although, this is true for both my options from the original question).

BlockquoteI’m a big proponent of filtering your data based on abundance in two aspects. First, you should be dropping out low abundance samples. My absolute base threshold is 1K reads/sample, but if there’s a natural break in your data or other split point, I’d recommend using that.

I always remove low abundance samples, and I'm still well above your absolute threshold :).

I also understand that compositional methods should be fine with differences in sequencing depths and rare OTUs as long as there is not systematic variation in them.

And it seems from my data that there is. Because the difference between the group 1 and two other disappears after subsampling or abundance filtering (which I would simply like to avoid, cos it adds another layer of complexity - do you filter by abundance, relative abundance, mean abundance, per sample, per dataset....huh :D).

Anyway, thanks for a great answer and links, going to read a bit now :).

M

Clarification definitely helps!

So first, is "don't distinguish" statistical or in PCoA? Because there's a lot of variation that can be hidden between ordination space and the underlying data. (It's a good day in my life when I can get PCs 1, 2, and 3 to explain more than 20% of my variation.) If this is in ordination space, I think this should re-enforce the need for normalisation in your beta diversity measurement, whether you like it or not ¯\_(ツ)_/¯?

If it were my data, I'd probably rarify and have done with it because I like phylogenetic metrics and there isn't a compositionally-aware phylogenetic metric AFAIK. But, I think you might like breakaway that I linked above, since it does normalization on a per-plate basis.

BlockquoteSo first, is “don’t distinguish” statistical or in PCoA? Because there’s a lot of variation that can be hidden between ordination space and the underlying data.

It's both. Rarefying with Bray-Curtis ends up with all groups being the same, use of full dataset with aitchison clearly sets apart one group.

However, I decided to go with filtering :). Maybe something like this..

I really like that normalization thing that you've linked, however my groups are confounded with plates, unfortunately. I mean, at least for comparing the tissues (which are very different), one could do that, using one tissue as control on both plates. However, I don't know if it would make sense to use that particular normalization for comparisons within tissues as well (plate one: group1 tissues 1 and 2; plate two: groups 2+3, tissues 1 and 2).