Adonis and ANCOM with rare features

jwdebelius · December 16, 2020, 11:13pm

I'm going to disrupt some of the order and jump around.

Metadata is one of the hardest parts of any microbiome project. I swear, I have multiple projects where I've spent more time and effort formatting metadata than I have actually doing analysis. Keemei may help some, it's a good first check. If you've already validated there, I would suggest moving into R or python with pandas to do a more in-depth check. I think metadata validation might be a good discussion as a separate topic.

hsapers:

Yes - my sample size is quite low - it’s a very difficult environmental system to access and it just isn’t possible to increase sample size other than to keep going back and use different time points as replicates. The problem is the source community isn’t stable enough to assume these are replicates. Unless I consider that a between sample comparison? If I treated each time point like a different subject in a gut microbiome experiment - it’s the same source community, but variation over time leads to slight differences - perhaps analogous on how there are slight differences in the starting gut microbiomes of different subjects?

This is the basic set up (with slight variations in design between deployments because something always breaks in the field):

Slow: 2x glass substrate; 2x crush substrate (all containers plastic)
Fast: 2x glass substrate (glass container); 2x crush substrate (glass container); 2x crush substrate (plastic-lined glass container)

After indicating that there is not a significance difference between 2x crush substrate (glass container) and 2x crush substrate (plastic-lined glass container) in the fast condition, these were pooled into the same conditions such that:

Slow: 2x glass substrate; 2x crush substrate (container material irrelevant)
Fast: 2x glass substrate; 4x crush substrate (container material irrelevant)

Thank you for the description! It really helps to visualize the design. Although I think one caveat to consider in your combination if you used a single time point is that your permutative p-value is bounded by the number of possible permutations and there are 4 ways to permute 4 samples, giving you 24 possible permutations, for a minimum p-value of 0.04.

As I conceptualize the design and try to find an analogy that's maybe easier to find a model for, analyzes that work with nested animal designs might help you. So, for rodents, we have to account for a cage effect because mice are coprophagic little bastards (and rats are coprophagic sweethearts). So, as you have 2-4 independent environmental systems in each group, and within each system, you have 5 nested timepoints. So, you need modeling that lets you account for repeated measurements in your system or some way to nest the data.

Permdisp has some issues with severely unbalanced designs - like 3 - 5 times or more samples in one group than the other; but it will help. Although, having seen your design, I think the unbalanced aspect is less of a problem than your number of samples.

Yes, that's correct.

I think if you want to get the deep implementation, it can be helpful to go read the code? From my practical perspective i tend to inter the equation as (A + B + A:B) (so, A + B + interaction) and move on from there. That's somewhat aligned with what you see in Figure 7 (page 40), at least based on my reading.

In R, a / indicates a nested variable. So, you can nest blocks based on your treatment. I may have mis interpreted your replicate - I was assuming technical replicates from the same site/time point verses multiple parallel systems the same timepoint. (So, I assume the replicate here was dependent verses independent.) If you have multiple nested factors (like mice in a cage), nesting can help address some of the variance there. So, this may be less relevant given more information about your design.

Coming back to 4...

At your current sample size, or my understanding of your current sample size, changing the pseudocount and transformation isn't going to solve your core problem, which is that compared to your multiple correction penalty, your sample size is too small. I can give you lots of studies where they can't detect individual features as differences (and plenty more where I dont trust the features they found because of their sample size and methodology).

If you want to take your time point into difference abundance, you might look at this recent thread about building nested models using LMEs (accounting for your temporal variation). My recent experience with this is that they also tend to be relatively low power and you may not detect anything significantly different.

I think, in general, microbiome differential abundance has swung more conservative.

Best,
Justine