Compute alpha/beta diversity at different taxonomic levels

hulfred · January 17, 2018, 11:44pm

Hi All,

I was trying to compute alpha and beta diversities at the family level. So I first retained features that were at least annotated to the family level, and then collapsed the table into the family level. What I have done is:

qiime taxa filter-table \
--i-table table-case-visit12.qza \
--i-taxonomy taxonomy-case-visit12.qza \
--p-include f__ \
--o-filtered-table Intermediates/table-case-visit12-with-phyla.qza

qiime taxa collapse \
--i-table Intermediates/table-case-visit12-with-phyla.qza \
--i-taxonomy taxonomy-case-visit12.qza \
--p-level 5 \
--o-collapsed-table collapse-table-case-visit12.qza

The first problem I met is that the some resulting features have something like "f_" which I am not sure whether should be considered as a feature annotated to family level. If not, could you please tell me how can I filter these features out?

And next, I tried to compute the alpha and beta diversities.

qiime diversity core-metrics-phylogenetic \
--i-phylogeny rooted-tree-case-visit12.qza \
--i-table ../1.\ FeatureTableConstruct/collapse-table-case-visit12-filtered.qza \
--p-sampling-depth 5901 \
--m-metadata-file ../1.\ FeatureTableConstruct/meta-case-visit12.tsv \
--output-dir collapse-core-metrics-results

And it showed me the error message like:

Note that the tree I used here was generated by the original sequence data (I don't know how to filter and collapse the sequence.qza data into the family level as what I did for feature table data).

I was wondering if there any way to deal with this problem?

Thank you very much,

Huang

Nicholas_Bokulich · January 18, 2018, 2:13pm

Hi @hulfred,

This sounds like an interesting approach! So you are trying to see how your samples partition based on family-level taxonomic classifications? I recommend that you use qiime taxa collapse as you have done, then use the non-phylogenetic diversity metrics (e.g., with qiime diversity core-metrics). This will compute bray-curtis and jaccard distances instead of UniFrac, which will effectively answer your question.

I'm sure you have good reasons for this but I personally would not go this route. There are many reasons why a sequence does not classify to family level, but just because it does not does not mean that it is necessarily a contaminant or something that should be excluded. Hence, just because you are looking at family-level differences does not mean that you should exclude this information.

These are technically annotations. See here and here for more details.

It sounds like you are aware of the problem — when you collapse your features, the family-level taxonomies become the feature IDs, which are not present in the tree. There is no way to relabel the tree with collapsed taxonomic labels in QIIME2.

When you collapse at family level, you lose a good deal of phylogenetic information, anyway, so Bray Curtis and UniFrac would likely yield pretty similar results... so my advice is just to go with Bray-Curtis instead of spending time trying to hack together a collapsed phylogeny (years ago I tried something similar to you, only to find that Bray Curtis provided very similar results!)

An alternative, instead of collapsing into explicit family-level taxonomic labels, would be to re-cluster your sequence variants at some defined level of similarity (whatever threshold would be considered family-level), align/build a phylogeny on those sequences, and use that for UniFrac. I have also done this recently — effectively to show how clustering changes as phylogenetic information is lost.

I hope that helps!

hulfred · January 18, 2018, 7:08pm

Hi @Nicholas_Bokulich,

Thank you so much for your reply! It helps me a lot! One thing I would like to know more about is the alternative method you mentioned.

Could you tell me more about applying this method? Is possible to apply this method by qiime vsearch cluster-features-de-novo and set a specific p-perc-identity?

Thank you,

Huang

Nicholas_Bokulich · January 18, 2018, 7:22pm

Yes! Exactly. That method takes a feature table and reference sequences (e.g., the output of dada2 or deblur) and outputs a new table/sequences clustered at the defined threshold.

Good luck! Let me know if you run into any problems.

hulfred · January 18, 2018, 7:33pm

Thank you! And could you tell me where can I find a specific similarity threshold for each taxonomic level?

Nicholas_Bokulich · January 19, 2018, 2:26pm

Hi @hulfred,

These percentage thresholds are fairly arbitrary and are really based on full 16S rRNA genes, not subregions, e.g., 97% is often considered genus or species level (big difference, right? we're off to a good start! ) but some species have more than 99% similarity, others have ~95% similarity! Some subregions are more/less similar for different clades. Etc. So I'm not sure there is a good answer to this last question.

Googling dredged up this discussion, which might give you a very rough rule of thumb. The thresholds they quote are:
97% ~ Genus
94% ~ Family
88% ~ Order

I hope that helps!

hulfred · January 19, 2018, 8:47pm

Hi @Nicholas_Bokulich, that is super helpful! Thank you so much!

system · February 20, 2018, 2:47am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.