High sample size heterogeneity and CoDa Analysis

i have come back to amplicon analysis recently and am trying to keep up with new developments.
Here is a fairly simple issue with the typical size variability of amplicon data sets when performing CoDa analyzes. Imagine this extreme example of a dataset (displayed are frequences of reads per sample):

The SD of reads in a sample exceeds the numerical mean (40,000 vs 35000), and 28 samples (out of 150) have a size of > 1 SD from the mean, with the largest sample being 170 times larger than the smallest. Thousands of extra zeros are being introduced because of the very large samples.
When doing CLR transformation, one would replace all these zeros with a small constant, and all these replacement values still may accumulate to very large amounts of non-zero "information".

My question is, if there is a best-practice approach to handle these situations of high heterogeneity in sample sizes with CoDa, apart from resequencing.
Do you remove all the rare features? Do you remove the very large samples? What is an acceptable range of ratios between small and large samples? or doesnt this matter at all in log-ratio transformations?

( I am aware that these messy datasets were also a problem in pre-CoDa times, and this was hotly debated a decade ago, but i rarely see this discussed in the CoDa literature.)

Hi @nouse1234,

The answer to this overall question is "It depends":tm:.

I don't think there is a best practice, but I can offer some of my approaches and reasoning.

I think the first step is to remove your very shallow samples. I tend to set this threshhold at my rarefaction depth, but if you're only doing CoDa and not doing like alpha diversity or something, you may not have that. I think a lot of tools (DEICODE, ANCOM-BC) are going for 500 sequences/sample; my personal thershhold tends to be that 1000 is my minimum, and TBH, that's probably too low.
Throwing out the super low depth samples gets rid of zeros and gives more confidence in the remaining samples.

Before CoDA, I will also filter out low prevalence features. My tendency is to require at least 10% prevelance. Partially because I think that's a threshhold I can use to make statistical inference if I want to do prevalence testing. Whether of not you want ot combine that with a limit of detection is up to you. (In q2-feature-table, there's a function called filter-features-conditionally that I tend to use. It's not perfect for dealing wtih depth, but it very much helps.

I'll also so that empirically, I find that DEICODE handles variation in library size really well. It's not perfect, but because of the magic :sparkles: in zero handling, it works better. (Any math that is suffeciently complex is indistingishable from magic to me... linear algebra is about my threshhold, but the docs for the plugin are great if you're so inclined.)