Are outliers driving my significant differences in abundance?

Stephan_Bitterwolf · May 3, 2024, 1:22pm

Hi all,

I am comparing the impact of herbicide and pharmaceutical pollution on coral microbiomes. I ran a huge experiment a few years ago, snap froze the corals, extracted the DNA, and sequenced the V4 region. Then I imported the data with Dada2 and have been doing differential abundance analysis using DESEQ2.

Deseq found taxa significantly different between my treatments. However, when I look at the normalized data the difference is sometimes driven by a single replicate (out of 7). Is it truly significantly different then? It seems like if the difference is due to a single high value, I should just omit that sample or something of that nature.

Here is an example where DESEQ finds that the abundance of ASV2530 is significantly greater in the Atrazine treatment. However, this seems to be due to a single value and not multiple such as in the A+E treatment.

I would love to hear if you have experienced something like this before and how you have dealt with it. Thanks!

Ps. I will also be running other differential abundance tests like ANCOM-II and ALDEX

timanix · May 4, 2024, 6:02am

Hello!
Let's start from the fact that Deseq2 is actually not the best choice for amplicon data! So I would not invest more efforts in that direction.
You already wrote that you are going to perform Ancombc2 and Aldex2 tests. Both packages are great and should handle amplicon counts better than Deseq2.
Regarding outliers, IMHO there is a very thin line between removing outliers and data manipulation. One should have really good reasons to delete outliers from the dataset.
Instead, I prefer to filter input tables with feature counts based on relative abundance and prevalence. For example, I filter table with absolute abundances, removing features with less than 1% or 0.5% relative abundance and that are found in less than 2-3 samples. Then I trust the output from Ancombc2, Aldex2 or LEfSe.

Best,

jwdebelius · May 8, 2024, 4:34pm

One more thought, if it's okay with both of you, @timanix and @Stephan_Bitterwolf!

I'm going to ECHO @timanix's position on compositional data analysis, with a slight caveat.

It looks like the "outliers" you're showing are outliers because they have reads mapped to the feature. So, the problem doesn't appear to be "outliers" as much as "present" (>10 reads) vs "absent" (0 reads). I don't know what the underlying biological mechanism might be for your ASV and treatments, but I can also come up with plausible biological/ecological mechanisms where a treatment might regulate either the presence of an organism or the observed presence.

I think applying a filter and seeing fi you retain ASV2530 would be worth while. I tend to pre-filter my data before I run several parallel analyses since its unfair to expect differential abundance tests to give me comparable results if they don't all start from the same set of features. I often the use composite filter in q2-feature-table, which defines a "presence" threshold along with a requirement that the sample be present in at least some portion of samples. (I usually use 1/rare depth in 10% under the assumption that my rarefaction depth is linked to the depth of my shallowest sample, accounting for personal neuroses, so I can detect the feature in my shallowest sample.) Based on your sample size, I'm not sure what would survive that type of filtering.

A second thought might be to consider just modeling the data (and thsi ASV in particular) as present/absent based on a cutoff of say 10 normalized or higher. It potentially shifts the type of modeling you need to do; I don't think we have any good options for prevalence-based differential abundance in QIIME 2. Its a slightly different set of assumptions from an abundance-based method, but might provide insight in a case like this. I'd recommend discussing with a biostatistican, if you've got one handy.

Best,
Justine