Making sense of diversity estimates

devonorourke · March 21, 2019, 8:40pm

What would you make of a dataset that, when you run said OTU table through Vegan to calculate distances using defaults (Method==Bray, Binary = FALSE) and depending on whether you filter based on per-sample ASV abundances, you get very different pictures?

For example:
Data is processed with Cutadapt and DADA2 defaults, then...
Situation-1. incorporate all reads of any abundance > 1
Situation-2. Require per-sample ASVs abundance > 100
Now calculate distances, and run metaMDS...

The distances for (1) all look super close to 0. The NMDS plot looks like a dart board from a professional dart player.

The distances for (2) are varied, and the NMDS plot is scattered as if I was throwing said darts after too many gin and tonics.

My suspicion initially was that there is pervasive low level contamination, and that by filtering low abundance data, I'm getting rid of the common stuff pushing down my distances. But the Bray test is incorporating abundance information, so I'm not convinced that's what's going on.

Perhaps it's worth trying a few other distance metrics? Something like Morisita-Horn weighs the dominant (per abundance) ASVs more heavily, but short of that, I'm not sure how else to explain what I'm observing.

Thanks for your thoughts!

jwdebelius · March 21, 2019, 10:21pm

Hey @devonorourke,

Okay, many thoughts, please bear with me.

First, have you rarified your tables before calculating distance? Per Weiss 2017, you need to rarify for beta diversity. If you're not working on rarefied tables, then sequencing depth is going to affect your observed diversity, especially with something like Bray-Curtis distance.

Second, I have questions about your filtering criteria.

Im not sure if per-sample abundance > 100 means that it has to have 100 reads in all samples (in which case, you're dropping a LOT of ASVs and should check your numbers), if it means it has to have > 100 counts total, or some composite.

I'm all for filtering out low abundance or low noise signals. I might check the distribution of my ASVs first, to see where things get cut off. Don't have one readily available, but I sometimes plot the prevalence (fraction of samples where the ASV is present) vs the non-zero average or non-zero sum and then try different filtering criteria to see how many ASVs I keep. It's semi stupid and not always the most effective, but it at least gives a sense of whether there are a lot (or a few) low abundance, high prevalence features, or if they all follow a similar distribution.

So, look at your filtering criteria? And, if Im reading this wrong, ignore this, too!

A Procrustes plot might help with the diagnostics, as well, since that should tell you how well the points map back to each other. It will let you determine which point in your NMDS 1 is an outlier, and once you identify that, I suspect you'll move back to your G&T player .

You can also always run a Mantel to look at the correlation between the two distance matrices, again, helping you determine if your filtering is removing low frequency noise or destroying semi-expected patterns.

You could try other metrics, but I think you need to sort out rarefaction, filtering, and potential outlier(s) first.

Good luck!

devonorourke · March 22, 2019, 12:26am

Thanks @jwdebelius - glad a pro jumped into the discussion right away!

Yes, data are rarefied. DADA2-filtered total abundances ranged between about 10-100k; I rarefied down to just under 10k to keep the majority of samples. I've done some internal comparisons of bat guano COI data with and without rarefying data, and couldn't agree more about the need to rarefy.

The 100-read thing... This is the equivalent of passing --p-min-frequency 100 in the filter features function in QIIME2 (I just do it in R directly). I'm saying I convert any element of the ASV table to 0 if that element is less than 100.
It's extreme and not something I do normally. I've read plenty of convincing threads in this forum about why dropping out low abundances with a flat integer is wishful thinking a best, but in this case, it's me trying to apply an extreme measure to try to understand what's going on with the data. Not saying it's the right way, just saying it's one way.

For what it's worth, there are about 5000 ASVs in this dataset (of which there are just under 300 total samples). The abundance filtering drops that value down to just over 2000 ASVs, which surprised me - I thought it would have dropped that down to a few hundred, not a few thousand. Another way to think of it: the unfiltered ASV table contained just under 20,000 elements with some data (ie an ASV observed in some sample with at least 1 read); after applying that minimum read abundance filter, there are still over 8,000 observations. Sure, it's tossing out lots of low-abundance info, but it's not tossing like 90% of it.

Adding to the issue: this isn't microbiome data, it's COI data from bat guano - these amplicons represent the arthropods in the bat diet, not the microbiome in the bat poop. The expectation that a lot of low abundance data should be there isn't necessarily true. It's a whole new frontier really - arthropod COI research has generally only discussed things in presence/absence contexts, so I can't speak to what normal looks like. Unfortunately his project didn't include any positive controls so I don't have a good way to evaluate what that low-level noise looks like (I have done this extensively in other projects, and it something in the neighborhood of 5 to 50 reads seems pretty normal for most of our MiSeq runs). So I hear you on how to think about filtering data in creative ways - I think I've tried plotting data like these until I've exhausted every single ggplot parameter...

One way I look at it is thinking about it in terms of ASVs occurrence of detection and read depth. Colors highlighted in red are those ASVs that made it past the cut; those in black are those that don't. As you would expect, most of these ASVs that don't make the cut don't occur in many samples, though some outliers up the y-axis are ones I tend to focus on as potential contaminants...

Another not particularly revelatory plot: the distribution of read abundances between samples and (negative) controls... it's dominated by low abundance reads. Applying a minimum 100 read filter to the data just shifts the distribution for either sample type:

Thanks for the idea about Mantel test - makes total sense; I haven't ever run Procrustes before so will give that a shot too.

jwdebelius · March 22, 2019, 10:42am

Hi @devonorourke,

Mastery might just be a matter of procrastination on everything else Im supposed to be doing. Although, Tony Walters (a legend on the QIIME 1 forum) got his God status from procrastinating on his dissertation, so it might be a long tradition.

I think the issue is your filtering, which is more of a data transformation step than an ASV filter.

What I'm guessing is that you have R-code which is something like this (please correct me):

t_unrare_filt <- copy(table_unrare)
t_unrare_filt[t_unrare_filt < 100] <- 0
t_unrare_filt <- t_unrare_filt[rowSum(t_unrare_filt > 0),]

Where I'm assuming the standard biom format of observations as rows and samples as columns. (Please note that I didnt actually try to run it, so it may be kinda wrong).

The --p-min-freqency is running a different function, which would be something like this in R:

t_unrare_filt <- table_unrare[rowsum(table_unrare) > 100,]

You're zero-inflating your data on a per-sample basis, doing some sort of transformation that I think isn't want you want. You could do something like the average needs to be more than 100, or something. This isn't really filtering, it's a transformation step.

Im guessing based on the distribution of your data that you do have a lot of flow frequency things that are not contaminants. (Thanks for the beautiful plots!) Your contaminant highlighed sequences make sense! But, you want to filter at a per-feature level, rather than making your data, because your contaminants should be consistent across all samples and therefore should have some characteristics.

At some point, in another thread, I really want to pick your brain more about this. For now, just... cool!

devonorourke · March 22, 2019, 10:49am

The R transformation...

I take the native .qza feature table and import it into R with the qiime2R package.
I then transform that into a matrix, and then transform again into a long-format data.frame, where every row represents an element of the initial matrix (so each row is single abundance value on a per-sample, per-ASV basis).

Here's the little function:

tableimport.function <- function(table){
  featuretable <- read_qza(table)
  mat.tmp <- featuretable$data
  df.tmp <- as.data.frame(mat.tmp)
  df.tmp$OTUid <- rownames(df.tmp)
  tmp <- melt(df.tmp, id = "OTUid") %>% filter(value != 0)
}

The whole requiring 100 reads filtering thing is applied at that last line - the filter(value ...) parameter. I think, yes, it's just like you suggested - I'm applying this on a per-sample basis. I change that to filter(value >= 100) for my minimum read abundances used in this discussion.And I agree that it absolutely results in a more zero-inflated matrix for the sake of distance measures.

Oh, I should add that this transformation is happening on non-rarefied data. I take this filtered dataset and then rarefy, and then pass that rarefied data into the distance estimates.

I'll look into thinking about an average weighted filtering step next and let you know what I find. Thanks again for the ideas.

Oh, and I'm always available to chat about bat crap. Or bird crap. Or even ancient mud samples. My boss throws a lot of crap my way so long as there are suspected bugs in there...

jwdebelius · March 22, 2019, 10:45pm

Ohh! That's a shiny R function. But, it's also not filtering your data in the way that you're intending and is, essentially, destroying a lot of patterns. I understand you're working in a new area, but it also just seems... not great for looking for patterns?

Right now, my main question is about species assignments and the like! But, I'll probably bug you about that next week when Ive finished procrastinating my current project into its next stage and I have more things to procrastinate on!

devonorourke · March 22, 2019, 10:54pm

Thanks @jwdebelius,
I can see thanks to your comments that my filtering strategy didn't reflect my goal.
I should add that I learned today that by dropping a single sample (of just under 300), the NMDS ordination all of a sudden started, well, looking normal. Don't mean to presuppose the answer - just mean to say that the distances went from basically all 0 (and one sample 0.99) to scattered across the two axes.
Apparently there is one really strange sample, and I'll now spend this weekend trying to figure out what is up with that outlier.

Long forward to a crappy discussion about guano any time!

jwdebelius · March 23, 2019, 8:15pm

@devonorourke,

I love those outliers! I hope you find something good about it. (Although I will also admit to having looked at my data more than once and just said, "Its an outlier, not sure why, but it is and look, when its gone, life makes sense, so...).

jwdebelius · November 28, 2022, 8:55pm

An off-topic reply has been split into a new topic: Why doesn't QIIME 2 have NMDS?

Please keep replies on-topic in the future.