Using read counts for abundance analysis

Hi All,

Our lab is relatively new to metagenome/bioinformatics analysis, but we are trying to do some comparisons between microbiomes across different specimen and over time. Abundance of taxa is going to be pretty important to our analysis, but I’ve seen a lot of conflicting information about how to use Illumina data to determine relative abundance. From what I understand (and the results of our mock community analysis) read numbers aren’t really a good proxy for relative abundance because of things like primer bias and copy number. But it seems like a lot of publications, as well qiime, do just use read numbers to directly determine abundance. Is there a way around the inconsistencies here?

Hi @mbUPS,

So, yes, on an absolute “are we seeing the true community” level, read numbers do not directly translate into “this is how many bacteria there are” for several reasons (not truly quantative, compositional, the aforementioned primer bias and copy number bias). It’s also true that most people operate on the counts, or possibly filtered counts for analysis. The problem is that we’re dealing with a model which is error prone, and we don’t generally have a better way to correct for the errors we’re seeing.

Our results are seen through the lens of the analysis, and in general, as a community, most people have chosen to use the same lens - blurry and smugged as it is. As of 2019, my best recommendation is to pick a pipeline you think best reflects the lens you want to use to view your data - and describe that well. Use the same pipeline in your entire experiment. If you can’t run everything, randomize well. Consider the limitations of your pipeline, and be able to defend your choices. I don’t think there’s a single consensus pipeline, despite several attempts to arrive at one.

If you chose to do marker gene sequencing (16s rRNA/18s rRNA/ITS), you are dealing with primer bias and copy number. You can also only infer gene content, if you’re working in an environment which has been well characterised. But, it’s cheap. You keep phylogenetic information, and there are a fair number of common pipelines (common lens, if you will) that you can use.
(You can also potentially pull Jonathon Eisen’s old algorithm for correcting for copy number. I dont have the citation on hand, nor do I know the name of the program, but it may be work looking into if its seriously a concern.)
If you chose to do metagenomics, you still have the ploidy problem, you to some degree lose out on phylogeny, and binning/annotation is challenging. You need to chose

In terms of statistics, there’s general agreement that your relative abundance data needs to be analysed as compositional, using a model like Gneiss, ANCOM, Phylofactor, or PhILR. There are a couple discussions here about compositionality, and why it is a necessary lens for your data.

Also, keep in mind that there’s a lot of information to be gleaned from community-level metrics which may be less sensitive to your copy number bias. These statistics can give you a sense of behavior while dealing with the fact that in free living organisms, microbial communities are highly individualized and its very difficult to completely change an adult’s microbial community and make them look like an entirely new person. Unless you give them an FMT. An big bolus of bacteria can shift the communities.

Finally, if you hit a point where you need quantative confirmation, you may want to look at a different technique. qPCR may give you answers you can’t get from 16s (and may function in a complementary manner).

But, at the end of the day, we are wrong. We know we’re wrong. Knowing our model is wrong or incomplete or something is part of the way we do science, triangulating until we get something better. And, hopefully, someday, we’ll have algorithms that address things like copy number, polidy, and primer bias. But, right now, we’re doing our best?

Not sure if my very philosophical response helps at all, but I have many feelings about noise and “wrongness” in microbiome data.


Hi Justine,

Your response was definitely helpful. I was concerned there was something obvious I was missing in the literature, but it’s nice to know there is an element of arbitrariness to consider. I don’t think that we are at the point of needing qPCR yet, but I’ll definitely look for into the other models and techniques you suggest. Thanks a bunch.

Hi @mbUPS,
Marker gene and metagenome sequencing on their own definitely cannot be used to measure absolute abundance (though there are several reports of using spike-ins or other supplementary methods to achieve this).

For relative abundance, results are certainly mixed as you have seen. As @jwdebelius has described, the methodology is not perfect — no method is — and the field continues to move and improve. Relative abundances can also be surprisingly accurate at times when procedural care is taken. Copy number differences are usually minor (for 16S rRNA genes, different story for ITS and other marker genes that can vary by orders of magnitude). Primer bias and extraction bias are bigger issues, but by and large we see relative abundance imprecision that is quite manageable overall, not orders of magnitude different from expectations (e.g., when using mock communities).

So yes, we recommend using read counts to determine relative abundances, but recommend that care be taken to ensure that experiments are well designed and controlled to minimize the types of biases that we know we can control.


Hi @Nicholas_Bokulich, part of our issue here is that we ran both a tiered and an even mock community along with our samples, and some preliminary analysis seems to show that our read counts don’t correlate to what we know the relative abundance should be. But it sounds like that would probably be some sort of prep/operator error and there isn’t a lot we can do about it after the fact?


I think this is a pretty common phenomena in mock-community studies


Indeed that is common — some mock communities show surprisingly close correspondence, while others I have seen show no correlation with the expected composition. Much of this is related to how the mock community was assembled and other pre-sequencing issues, e.g., with PCR and library prep. It is probably impossible to figure out what went wrong with any given mock community unless if that is explicitly built into the experimental design (e.g., using dual barcoding to reduce “index hopping”). I would recommend using QPCR to assess the expected composition of the mock community pre-sequencing.

1 Like

@Nicholas_Bokulich and @jwdebelius, I think we will look into the qPCR. Thanks for all of your help!