So, yes, on an absolute “are we seeing the true community” level, read numbers do not directly translate into “this is how many bacteria there are” for several reasons (not truly quantative, compositional, the aforementioned primer bias and copy number bias). It’s also true that most people operate on the counts, or possibly filtered counts for analysis. The problem is that we’re dealing with a model which is error prone, and we don’t generally have a better way to correct for the errors we’re seeing.
Our results are seen through the lens of the analysis, and in general, as a community, most people have chosen to use the same lens - blurry and smugged as it is. As of 2019, my best recommendation is to pick a pipeline you think best reflects the lens you want to use to view your data - and describe that well. Use the same pipeline in your entire experiment. If you can’t run everything, randomize well. Consider the limitations of your pipeline, and be able to defend your choices. I don’t think there’s a single consensus pipeline, despite several attempts to arrive at one.
If you chose to do marker gene sequencing (16s rRNA/18s rRNA/ITS), you are dealing with primer bias and copy number. You can also only infer gene content, if you’re working in an environment which has been well characterised. But, it’s cheap. You keep phylogenetic information, and there are a fair number of common pipelines (common lens, if you will) that you can use.
(You can also potentially pull Jonathon Eisen’s old algorithm for correcting for copy number. I dont have the citation on hand, nor do I know the name of the program, but it may be work looking into if its seriously a concern.)
If you chose to do metagenomics, you still have the ploidy problem, you to some degree lose out on phylogeny, and binning/annotation is challenging. You need to chose
In terms of statistics, there’s general agreement that your relative abundance data needs to be analysed as compositional, using a model like Gneiss, ANCOM, Phylofactor, or PhILR. There are a couple discussions here about compositionality, and why it is a necessary lens for your data.
Also, keep in mind that there’s a lot of information to be gleaned from community-level metrics which may be less sensitive to your copy number bias. These statistics can give you a sense of behavior while dealing with the fact that in free living organisms, microbial communities are highly individualized and its very difficult to completely change an adult’s microbial community and make them look like an entirely new person. Unless you give them an FMT. An big bolus of bacteria can shift the communities.
Finally, if you hit a point where you need quantative confirmation, you may want to look at a different technique. qPCR may give you answers you can’t get from 16s (and may function in a complementary manner).
But, at the end of the day, we are wrong. We know we’re wrong. Knowing our model is wrong or incomplete or something is part of the way we do science, triangulating until we get something better. And, hopefully, someday, we’ll have algorithms that address things like copy number, polidy, and primer bias. But, right now, we’re doing our best?
Not sure if my very philosophical response helps at all, but I have many feelings about noise and “wrongness” in microbiome data.