Seeking alternatives to rarefaction for Alpha diversity analysis

Dear all,

I have several samples with very different sampling depths. During the rarefaction step I would either lose about 25% of my samples or I would need to use a sampling depth of 1000 to retain roughly 90% of them. I’m wondering if there is another way to calculate alpha diversity without losing so many samples?

Hi @asmaamorsi,
All Diversity metrics should be generated after applying an even sampling depth so that varying sequencing depths don't affect your diversity.

q2-boots is a new qiime2 plugin that performs rarefaction (samples n times) instead of rarefying (subsampling once). Because q2-boots performs rarefaction, you can get away with a lower sampling depth and more sample retention.

6 Likes

Hi @asmaamorsi and @cherman2,

Can I jump in too?

There are no hard and fast rules around sequencing depth, but tthis is a depth I work with a lot. Is there a reason you think it’s too low for your work? Did you start with a lot more reads and lose some in processing? What is your depth distribution, and what do your rarefaction curves look like?

I personally wouldn’t have an issue with a 1K rarefaction depth as an analyst, supervisor, reviewer, or editor, if due dilligance was done to show it was the happy medium for the data set.

Best,
Justine

4 Likes

I'll just add that combining these two suggestions - using a sampling depth of 1000 with the q2-boots core-metrics command - could be a good way to go (pending the due diligence that @jwdebelius is suggesting).

2 Likes

Very timely discussion, as we just got back some reviewers comments. Had an issue with our ~6k reads across 60-70 libraries. Working on a response, so curious if you would expand a bit on what you would include as your “due diligence”. Thanks!

1 Like

Hi @bsteve1120,

I hope I'm interpreting this correctly: your reviewer is concerned you have 6,000 reads for 60 samples, as in ~100 reads/sample? I don't blame them, I would be annoyed, too.

Let's start with the fact that is an approximately fixed number of reads in a sequencing run. So, imagine we have a sequencing run with 1,000,000 reads. The average number of reads per sample (if my pools are equal concentration) is going to be that 1,000,000 reads over the number of samples. So, I coudl spend all 1M on a single sample. (This feels like a waste of money to me, but YMMV). I could put on 10 samples and get 100,000 reads. I could do 100, and get 10,000 reads, etc. I'm assuming that you didn't multiplex into oblivion, but if you did, there's still a maximum number of reads avaliable on a run.

In terms of checking depth, I recommend following the reads across steps.

In q2-demux, there's a summarize function that will tell you how many reads there are. Start with your demultiplexed data and look at every processing step. How many reads do you have at the beginning? We can't rescue data that isn't there. If you trimmed primers, how many do you lose? When you denoise, what do your denoising stats look like? Are you seeing big drops in numbers in any of those steps?

IIRC, you should be able to concatenate the demux summaries and maybe the DADA2 summary into a single tabular file. I'm relatively visual, so sometimes I will plot the average portion of reads lost at each step, so I can see where, if any, is giving me a big drop off.

I would predict :crystal_ball: that you're losing reads either at the primer trimming, in quality filtering, or at read joining. The loss could also be annotation if that's a filter you use. (For example, drop any read that doesn't have at least phylum level annotation.) Depending on where your reads are getting lost, there are different solutions to look at.

If I misinterpeted, I'm happy to talk about the other case.

Best,
Justine

Hi Justine,

No, sorry, I wasn’t clear. That was 60-70 libraries, rarefied to 6,300 reads per library. I was just curious to hear what you would consider due diligence to ensure that your choice of a cutoff was appropriate.

I would have a hard time justifying 100 reads as a cutoff too!

Thanks,

Brad

Hi @bsteve1120,

I'm glad! Sorry, it is hte end of my year and I have burned through some of the optimize I entered with. That's a lot easier to do due dilligance on.

I tend to think of your rarefaction depth as a balance between sample retention (managing type I error) and feature retention (type II error). I tend to approach it in three ways. First, look at your feature table summary. If you pick different depths, how many samples are you retaining? Have you lost 50% of your samples at 6.3K? (If yes, go shallower). Do you only drop 1-2 more to go to 10K? If so, are those worth it? You could potentially incldue this as a table or curve in your response to reviewers to show that your depth is a sweet spot for what you want to do.

Second, look at your alpha rarefaction curves. I woudl use Shannon here. Observed features and other richness metrics tend to gain richness as depth increases because denoising isn't always as perfect as we wish it was. So, Shannon lets you measure richness with de-emphasizing the rare species you're picking up. (You can do richness too, your argument will just be weaker.) Have your curves plateaued at 6.3K?

Finally, there are a couple of sensitivity analyes you could run. Pick three depths, based on what the reviewer wants. I'd probably do one lower, one higher, and yoru current depth. Pick samples that are above that depth and use a tool like q2-boots to run a couple of alpha and beta diversity at those depths. Look at the correlation between yoru alpha diveristy metrics. I'm guessing it's going to be >90%. Check the relationship between your samples in PCoA space using a mantel test (q2-diveristy mantel) whcih gives you the correlation betwen two distance matrices and/or a procrustes analysis (q2-diversity procustes) which shows you their relative relationship in PCoA space. I'd put money on those relationships being relatively stable over depths where your rarefaction curve is plateaued.
In general, my expeirence has been that the relative relationship between samples is stable enough across depths compared to other factors that can have an influence (rarefaction iteration, for example.)

You will always get people who go "rule of thumb says..." and that rule of thumb is usually built on their personal experince with data and biases.
I'm currently having the same struggle in a few of my papers, so hopefully this works for both of us!

Best,
Justine

3 Likes

Thanks Justine,

We (now that I was more clear) are both on the same page on this issue. Thank you for sharing your “approach”. Always good to discuss these things bc you never know what new ideas you might pick up. Shannon diversity via rarefaction curve is flat for all samples at our chosen cutoff. Would be for somewhat lower cutoff and of course at a higher too. That was definitely one of our truth serums we used as rationale for choosing our cutoff in the first place.

This was very helpful. The q2-boots is a new feature to me so I will definitely check it out. Really like the PCoA of the q2-diversity mantel at different cutoffs. Gonna have to try that one too.

Cheers,

Brad

2 Likes