how to analyze this crap, or more specifically, two piles of differently sized craps.

devonorourke · July 30, 2020, 12:47pm

Hi everyone,

Let's talk about + !

For those of you not regularly knee deep in guano, know that bat poop looks basically just like little mouse or rat poops. I mention this because we generated sequence data from guano in one of two ways:

A single pellet was collected and DNA was extracted from that single pellet.
A bunch of pellets, maybe 4-5 per "pool", were smushed together (technical term), and DNA was extracted from that pool.

In both cases, we amplified a marker gene, COI, from every sample and now I'm left with pile of sequence data from these samples. We use this COI marker gene to investigate what the bats are eating, as these primers are really targeting arthropod COI primarily.

The design of this experiment is pretty simple. Samples were collected:

from two separate locations (Egner and Hickory)
monthly, over three different collection months (June, July, September)

With that in mind, I'm curious about how to proceed with a basic alpha diversity analysis. It seems pretty clear to me from the picture below that the observed richness, as well as Shannon's or Simpson's diversity values differ when I partition the pooled from single sample types. Note that the x axis represents the two different sites, and each point represents a samples alpha diversity value:

What's curious to me is the ANOVAs. Let's say I try to model the data where all of it is grouped together regardless of whether the sample came from a single or pooled form (what I'm calling a BatchType in the model below):

alpha_diversity ~ Site * Month * BatchType

In this case, I observe significant main effects for Site, Month, and Batch Type. I also see a significant interaction between Month:BatchType. The plot above would suggest to me that this is due to those pooled samples from HB (i.e. the Hickory site).

But grouping these data and controlling for the effect of BatchType isn't something I really want to bother with. Would it be smarter to treat these data as two completely independent studies? In other words, would it be more appropriate to model them separately like:

analysis of single samples:
alpha_diversity_single ~ Site * Month

analysis of pooled samples:
alpha_diversity_pooled ~ Site * Month

I've seen posts before mentioning the need to control for effects like sequencing platform, or sequencing runs, and I understand the merit of trying to control for that. In my case, I expect, and observe, a higher amount of diversity for rarefied data from my pooled samples.

Indeed, when I split up the data and analyze the single and pooled samples separately, I see different significant main effects from the initial model where I include BatchType, as well as between the two models where these data are split:

alpha_diversity_single ~ Site * Month

significant effect for Month and Site for Observed OTUs (richness), but...
NO significant effects for Month or Site for Shannons/Simpsons

alpha_diversity_pooled ~ Site * Month

significant effect for Site but NOT month for Observed OTUs (richness), yet ...
significant effects for Month or Site for Shannons/Simpsons

(compared with the original combined):
alpha_diversity ~ Site * Month * BatchType

significant effect for Month and BatchType for Observed OTUs
significant effect for Site and Month and BatchType for Shannon's and Simpson's

Looking at these data by splitting samples into their different BatchType groups, or keeping them in a big group and trying to control for it as a main effect paints two pictures in my mind, but both appear to point to the pooled samples as driving the differences in alpha diversity.

I'd love to hear others thoughts on how they might tackle this problem. Thanks for your help with all this crap

Nicholas_Bokulich · August 6, 2020, 6:48pm

Hey @devonorourke,

Since nobody else has answered your question, I'll give my two cents.

Sampling method clearly exerts a strong bias here. That is interesting in and of itself, as it suggests to me that the pooled samples either (a) may come from different species (for which we'd expect higher beta diversity to drive higher alpha diversity on pooling), (b) there is a high degree of inter-individual variation, which I somehow find slightly surprising (knowing only a little about the social behaviors of bats), or ( c) pooling incorporates more exogenous contaminants.

Bottom line is that such a strong bias would limit your ability to use pooled and single samples in the same analysis. You should either analyze single or pooled, but don't mix them in a single test.

Why you see higher alpha diversity in HB in pooled samples is indeed interesting... maybe you have more mixed-species colonies in HB? And hence higher inter-individual variation (since the population includes members of different species)? Or just higher inter-individual variation for some reason (more diverse diets?). Cool stuff!

devonorourke · August 6, 2020, 7:09pm

Thanks @Nicholas_Bokulich,
It's really a mess when you think about how these collections are generally conducted: you sample either a single piece of guano from a giant pile and process it as an individual sample, or, you sample a bunch of guano pellets from a giant pile and treat them as a pooled sample. In both cases, there's risk of contamination of adjacent turds touching the single guano piece, right? In fact, I drove myself crazy with this experiment trying to figure out the proportion of bat COI sequences in a given sample, because it turns out the single samples had mixtures of multiple bat species more often than pooled samples!

The plot below is a bit odd to consider at first. The guano samples are faceted into the two horizontal charts. Single guano samples on top, or pooled guano samples on bottom. There were three species of bats in the study. The color scheme was to indicate if a particular sample contained just that single bat species (colored green), or if it contained more than one bat species (colored brown). I'm realizing now years later that a brown dot is always double counted, so be aware of that (for example, if both M. sodalis and M. lucifugus were detected in a single sample, there would be a brown dot in each of those species columns).

I didn't circle back and determine if most of these mixed samples came from the HB site, and that indeed might be the case.

My judgement was to move forward analyzing only the single samples, as there were almost 200 of those single samples, compared to just about 70 pooled samples. Kind of a bummer to throw away interesting data, but for the purposes of this study, simpler and more straightforward was the better option.

Thanks for your considerations!

Nicholas_Bokulich · August 6, 2020, 7:16pm

Agreed, it is a bummer, but it seems like the safest approach.