Sampling Depth (sample specific question)

rmbn · November 28, 2021, 6:23pm

Hello all!

Firstly, I'd like to thank all the people who put out so much effort in making this forum a really interactive and informative one. I started a few days ago w/o any knowledge in running Linux commands and more so utilizing qiime2 but now, I've finally smoothen out my qiime2 pipeline and able to perform sample processing and analysis.

For my ultimate question, I've read about sampling depth has no universal consensus as to how we select them and I've read about a couple of posts here about similar question and I (kind of) understand that on the process, it's a trade-off on whether you want to include more features but drop few or more samples from the analysis and vice versa.

The sample with the lowest number of features but is close to the maximum is what I guess everyone dreams about. Unfortunately, for my case, it's almost exactly the opposite. My samples are bird gut samples collected in time series (supposedly 5 birds each time point, although few birds have no samples/bird subjects have died prior collection so each time point has varying number of bird samples leading to uneven number of samples each time point). Sequenced 16S v4 amplicons and microbiome was profiled.

So I did my summary.qzv and saw the distribution of frequency in each of my sample. The highest number of frequency I have is around 124k (other samples play around from 100k to 120k) and the lowest is 60k (top 3 lowest are 60k, 89k, and 90k which, ironically, came from the same time point.

As much as possible I wanna retain them all but I've read some cases where they drop samples that's way kinda off the rest of the samples.

So there's two possible scenarios:

A) If I keep the sample with 60k, I'd retain 55% features but covers 100% of the samples (the samples for this time point will be n=5)

or;

B) If I drop the sample with 60k, and start the sampling depth with 89k instead, I'd retain 76% of the features but one of the samples will be dropped from downstream analyses (so my samples from one time point will be n=4 instead of 5, this will also make samples for each time point uneven).

I've never done this analysis before this intensive so this is actually my first time dealing with this dilemma. I'm kinda leaning towards Scenario B but I'd like to hear your opinions based on your experiences dealing with same dilemma?

Thanks a lot and regards!

Mehrbod_Estaki · November 29, 2021, 6:06pm

Hi @rmbn,
In case you haven't seen this, I shared my thoughts on this topic before here
In your case, 60k reads as your minimum is very high. Rarely do I get to work with a study that has 60k as its lowest sampling depth. If I were you I would just include all the samples and not think about it twice

rmbn · November 29, 2021, 7:50pm

Thank you. This is really informative. I was weighing on it since I thought it is way off than the rest of the samples. My rarefaction curve shows this specific sample has wayyy too much OTUs observed than the rest and I constructed a dendrogram that places this sample as an outgroup. I decided to keep them all and I'm glad that I can keep all my samples. Thanks a lot for your very informative input. This gave me now the idea on how to weigh in with regard to sampling depth in my future studies.

Mehrbod_Estaki · November 30, 2021, 5:55am

Hi @rmbn,

Glad you found it useful, it's a rather difficult topic but it does get easier the more often you deal with it!

If you think there is an oddly inflated number of OTUs in that sample it may be worth doing some sanity checks. My suggestion about 60k total reads for your lowest would be more than sufficient is purely regarding the sequencing depth. Having an inflated diversity may point towards other unrelated issues, for example, failing to remove primers/barcodes or other non-biological reads, having a sample contaminated, low bacterial to host DNA ratio in your sample leading to representation of host DNA, or some odd chimera formation.

tldr; your depth is good still, but you may want to look closer at those outlier samples.

rmbn · November 30, 2021, 10:00pm

Thank you! I tried cleaning the sequence further. It turns out it has some Illumina Universal and Small RNA 3' adapters (strangely, it was the only sample that has those contaminants ). Though OTUs still kinda off with the rest, the sampling depth increased drastically to 78k (although I used a better clustering method, which I think contributes to that!). I'd certainly check on what's going on with the sample as I analyze the results further. Thanks again for the help!