Sampling depth - Is my sampling depth ideal

Hi,

I am trying to choose a sampling depth for my table-deblur and table-dada2. Based on what this post How to decide --p-sampling-depth value? says I need to choose a value such that I retain high number of features as well as samples so I capture the true diversity.

I chose 20825 for table-deblur. Using this value I would retained 2,353,225 (60.13%) features in 113 (81.88%) samples.

I chose 38093 for table-dada2. Using this value I retained 4,837,811 (63.93%) features in 127 (92.03%) samples.

1- I was wondering if someone could take a look at my files and let me know if I am on the right path.
Here is the files:
table-deblur.qzv (491.6 KB)
table-dada2.qzv (555.5 KB)

2- Looking at these file I noticed that I obtained 3,913,403 feature after runnig Deblur and 7,567,910 after runnig DADA2. I know from previous posts that this difference is due to the way these two methods handle the error. But I am not sure which one is more reliable. Which one should I choose for downstream analysis?

Thank you for your support.

Hi again @ptalebic!
These are good questions, and have been discussed at length here on the forum. I’ve done my best to answer them again here, but you’ll really want to dig into the resources at the end of this response.

  1. Your sampling depth is quite good, so you have more wiggle room than most studies. You’re mostly right that the goal here is to find a balance between “the most sequences” and “the most samples”. Considering how metadata categories are impacted is critical.
    For example:
    You can keep all of your samples if your depth is 9634. That’s still a fairly large frequency per sample, and if sample collection is expensive, you don’t have many samples, or if the samples you might lose are critical to your study, then this might be a great choice.

If these things don’t impact your study particularly, then you might be able to crank up the depth without sacrificing important samples. At 20825, you’ve lost quite a few samples. You happen to have many samples to lose, and most of the loss is evenly distributed across bins, or comes from heavily-sampled bins. This depth may be OK for you.

Of note, though: at this depth you’ve lost most samples from hs60 and hs61:

… and you may have disproportionately dropped samples from older patients, which could introduce bias if not handled carefully:

.

I’d recommend taking a long look at your data using these tables and alpha-rarefaction curves, and making your choices based on your study’s unique needs.

  1. Sequence-count differences may be related to how the two tools handle error, but could also be a byproduct of the parameters you chose. Some parameter choices will yield more sequences, and some will yield fewer. If you haven’t already, you can use some combination of trial-and-error and the data in your denoising stats (both tools produce something like this) to optimize your parameters.

As for which tool to choose, the answer is again, “what does your study need?” DADA2 provides some really nice features - quality filtering, read joining, and sequence repair. A look at the resources below and some searching should give you the background you need to start making the best choices for your study.

  1. Tutorials - the QIIME 2 team, with the help of some awesome community members, has built a solid collection of tutorials. They’re all worth working through, but you’ll probably get the most value right now out of this bit on alpha rarefaction, and the rest of the moving pictures tutorial. This section on denoising may also be useful. Let me know here if there’s anything unclear.

  2. The docs - In addition to useful details on how commands work, plugin documentation includes citation information, so you can look at the papers that describe DADA2 and Deblur directly, and see how they differ.

  3. Existing forum posts can be super useful in figuring out how things work, and why people prefer one tool over another. Plus, the answers are already there, so you won’t get stuck waiting for a response. The magnifying-glass icon :mag: at the top right of the screen will take you amazing places. :mage:

Good luck,
Chris :ant:

4 Likes

Dear Chris,

Thank you for your great explanation. I will start digging into the resources and I will let you know if anything is unclear.

1 Like

Dear Chris,

I plotted alpha rarefaction curve. Here is the screenshot of my plot:

The plot shows that diversity levels off at sequencing depth of 2000. Based on my knowledge, using this depth I can be sure that I will capture most of the diversity. Please correct me if I am wrong.

So should I decrease my sampling depth from 20825 to 2000? However, 9634 is the lowest sequence count in my sample so I think I should use 9634 as my sampling depth.

I would again appreciate your help.

There are many different perspectives on this. What you choose is largely a matter of personal preference and study data/needs.

You’re interpreting the alpha rarefaction curves correctly - assuming they level off for any combination of metadata category and alpha diversity metric that matters to you, that leveling point can be used as a rough “minimum”. As you suggest, there’s no need to decrease depth to that minimum.

I often use the approach you’ve suggested, choosing the highest possible number of sequences I can without losing a specific sample. Better scientists than I have suggested that this might introduce a little bias (I suspect because your low-count sample will not be subject to random subsampling, while all of your other samples will be).

These folks may select an arbitrary reasonable rarefaction depth - say, 10k reads - and apply that without splitting hairs over a few reads. Even in this case, though, you want to consider the affect that sampling depth will have on your data, in terms of utility and bias, and select a threshold that won’t damage the meta-study if your next data set isn’t this robust.

You probably have enough samples to safely lose a few, but you may not have to lose any. Your decision comes down to balancing “how deep is deep enough” for my study, against preserving as many samples as possible. Only you can make that call, but here are a couple opinions that might help. (1, 2)

2 Likes

Thank you Chris for your reply and providing those two links.

1 Like