Collapsing feature table to a specific taxonomic rank before diversity analysis?

Hi there!

I have a feature tabel from shotgun metagenomic sequencing which I produced with MetaPhlAn3. From there on, I use QIIME to calculate alpha diversity metrics.

I am wondering if I should collapse my feature table to the species rank before doing alpha diversity or if I should use the complete table. Because of how MetaPhlAn works, I do not always have information of every microbe down to the species level (e. g. sometimes down to genus level, sometimes family level etc.).

This gives me some uncertainty:

The full table has significantly more "entries" than the reduced table (because a taxonomic tree is provided which shows every tax rank until the assigned rank of the microbe, see screenshot)

What I am concerned about using the full table is: If I have a sample with 1 species, then I have 7 entries (Kingdom, phylum, class, order, family, genus, species). When there are 2 species, there are 14 entries. But when a microbe was not classified down to the species level, there will be less than 7 entries for the microbe. Also, if 2 species belong to the same genus, the whole taxonomic tree until that genus is only given once. This results in a non-constant number of entries across samples.

On the other hand though, I am missing out information when collapsing to one taxonomic rank. The devlopers of MetaPhlAn also have a tool for differential abundance testing, where they recommend collapsing the feature table.

Do you have any advice about that?

Thanks a lot!!

Does anyone have an idea?

Hi Philipp,
As well as I know, in the Metaphlan3 output you will get microbes with the relative abundances (unless you specified otherwise) and this table is already collapsed to the species level. So, the closest option to the ASV or OTU tables will be filtering the tables to contain only species, since (not sure!!!) genus and other annotation levels represent relative abundances of all species, included into this genus or taxa level.

Hey Timur,

thanks for your answer!
your are right, Metaphlan only provides relative abundances. From there you can compute some alpha metrics like observed_features or shannon_entropy.
What I am wondering about is if you should calculate diversity metrics on the full feature table (providing all taxonomic ranks from kingdom to species) or if I should collapse the table to one rank (e. g. species) beforehand.

I would guess that in 16S surveys, there are also sequences that can be assigned only to the phylum level, whereas other sequences can be assigned to genus or species level. Would you collapse that table to e defined taxonomic rank before calculating statistics or would you go with all taxonomic ranks to avoid missing out information?

You also can use option -t rel_ab_w_read_stats to obtain "absolute" abundances instead of relative ones for each sample and then merge it to one table using modified script that I provided here.

Here I would recommend to filter your table to only include species level annotations and probably unclassified sequences, since genus level annotations in that table include all sequences, assigned to genus level (including sequenced assigned to species).

In case with Metaphlane3, I would filter my table to only contain annotations at the desired level (collapse) plus probably keep unassigned queries.