Relative Abundance considering OTU

I wanted to ask, what is the best method to assess/express the number of OTUs of 1 phylum amongst 30 samples? I have a table of OTU for ex: at the phylum level; I have 30 phyla and 20 samples. I want to express the relative abundance of each phylum in general. I see on articles they give a result not specific to a sample. I don't know how to calculate it. I made a calculation: I get the sum of each taxonomic group's OTU of each sample. Then I divide it to sum of the total OTUs. then expressed it as percentage. But I'm not sure if it is a correct way to express a specific taxonomic group's abundance in general

1 Like

Hi @kubra,
Welcome to the :qiime2: forum!

I think that what you're describing is correct, but here's an example to make sure.

If your sample1 has three phyla with the following frequencies (sequence counts):

phylum1: 100
phylum2: 200
phylum3: 300

The relative abundance of each would be:

phylum1: 100 / 600 = 0.167
phylum2: 200 / 600 = 0.334
phylum3: 300 / 600 = 0.5

In QIIME 2, you can compute relative abundances from your FeatureTable[Frequency] using qiime feature-table relative-frequency.

Be aware that these abundances are compositional, so compositionally aware methods are required to analyze them, and microbial marker gene data are known to be semi-quantitative at best, so should also be interpreted with that in mind.

Good luck!


Thank you so much, in some articles I see they represent a relative abundancy comprising all samples ( as far as I understood) based on this they do not specify any samples, It seems mostly like to give an idea overall. I have each specific taxonomic group's frequency in each station. But I want to represent the most abundant/prominent groups over all. So I did a calculation: I get the sum of a particular taxonomic group's OTUs across all samples. Than I divide that number to the sum of all OTUs in my table. Then I expressed it as percentage to give an idea about the abundancy of that particular taxonomic group namelly to say that one is the most outstanding group overall. But I cant be sure if I do something appropriate. Thanks a lot.
Best Regards

@kubra, I see, thanks for the clarification.

I would recommend computing relative frequencies on a per-sample basis first, and then presenting the median and median absolute deviation relative frequency for each taxon across all samples. Or, better yet, present the distribution of relative frequencies for each taxon with a box plot per taxon.

The approach you're describing will be subject to outlier effects (e.g., where one abnormal value throws off mean), so it's better to present some estimate of the variance or to show the distributions as a whole.

Thank you so much for helping me. That was the exact answer I'm looking for.
Best regards🙏

Hello again. I have a similar question but in some parts, it is different (I guess). I have 30 samples 16S based analysis was applied to those samples. I have a phylum-level OTU table (30*15) that shows the number of reads for each phylum (total of 15) present in each sample. Let's say I got the sum of all OTUs in my table (A) and I got the sum of the OTUs of X phylum coming from 30 samples (B). Then I got the percentage of B in A and I found 45%. I did the same thing for Y phylum and I found let's say 35%. Can I represent this result by saying 80% of the sequences are clustered in 2 phyla in order to represent the most prominent phyla in my study that may give an idea about overall? Because I see some articles similar to that. But I can't get the idea that they are following then I reached that question:) Thanks for help

Hi @kubra, Let's refer back to my earlier example to make sure that I'm understanding.

In this case, you can say that (for example) phylum1 and phylum2 make up 0.501, or 50.1% of your sample1.

If you would like to present this information across all samples, I would recommend computing this for each sample individually, and presenting a visual representation of the distribution as a raincloud plot and a numeric representation as a seven-number summary of the full distribution. The reason is that if you compute this across samples after summing abundances, your final estimate can be non-representative if outliers skew the distribution.

I hope this helps!


Thank you so much again. @gregcaporaso . I'm going to try everything you've told. When I saw in some articles I was confused but now I got it. Thank you so much again for your detailed explanation.
I wish you the best.

1 Like