@fgara, here’s a sequencing depth distribution from ~1500 samples across five runs in the same study. All fecal, same equipment, same protocols, kits, subjects in controlled environment. Samples with <1000 features have been dropped (these were mostly controls) Depth ranges from ~1000 to 200,000+ reads per sample. Though it represents more samples, the curve isn’t too different from the distribution you have - a very-roughly-bell-shaped curve followed by some outliers.
I would like to stress that my work is on the software side, and I’m not much of a bioinformatician. There’s probably extensive literature on this, but I have no idea. I also don’t have the lab background to tell you why this might be normal, but for our workflow, these results are not unexpected. I suspect that, as with any complex procedure (sample collection, sample prep, storage, extraction, sequencing, etc) there are many opportunities for small things to impact the sequencing depth of a given sample. Or the diversity. Or even the sequences present.
As you suggested, normalization can help us reduce the impact of some of these issues and make our data more useful. Collecting comprehensive metadata is also critical in trying to identify and correct for biases. We can quantify the degree to which, for instance, one extraction has different characteristics than the other extractions. Or recognize when samples stored in one freezer failed to sequence in the way we expected. Comprehensive metadata allows us to answer not only our experimental questions, but also questions about the validity of our data, where bias might have crept into our process, and whether, for example, all of our samples are actually useful/valid.