Understand diversity metrics

Hi all,

I am new to the use and interpretation of qiime results. Perhaps the answer is obvious, but after I reading the answers in the forum it is not clear to me yet.

I ran the beta diversity metric and I got the following results:

unweighted_unifrac_distance matriz. qza
unweighted_unifrac_pcoa. results. qza
unweighted_unifrac_emperor. qzv

weighted_unifrac_distance matriz. qza
weighted_unifrac_pcoa.results. qza
weighted_unifrac_emperor. qzv

jaccard_distance matriz.qza
jaccard_distance pcoa.results.qza
jaccard_distance emperor.qzv

bray_curtis_distance matrix.qza
bray_curtis_pcoa. results.qza

I am trying to replicate this in R, but it is not clear to me the difference between the three files it throws at me for each index, example: I have three results for unweighted_unifrac

unweighted_unifrac_distance array. qza
unweighted_unifrac_pcoa. results. qza
unweighted_unifrac_emperor. qzv

I should use the unweighted_unifrac_distance array file. qza to make the box and whisker plots ?; the file unweighted_unifrac_pcoa. results. qza is the one that I must export to R to obtain the same results that I see in the unweighted_unifrac_emperor file. qzv or just the emperor visualization is enough and can I use that PCOA chart for my analysis?

I will take the opportunity to ask another question: I am working with the lung microbiome in three groups of studies: (HIV + pneumonia, pneumonia without HIV and HIV without pneumonia), I obtained these results in the Shannon index and I do not know if the values are correct, can you have high values in this index?

Sorry for so many questions, I am new and I am trying to learn from this forum. Sorry also for my english, it is not my first language

I'd appreciate your help




Welcome to the community! Thanks for looking through forum posts before asking, that is generally a great approach. In this case some of the docs might be more helpful to you :slightly_smiling_face:

This video from a workshop last year might be a great place to start. The link takes you to the point in the video where the discussion turns from generating the distance matrix to the PCoA. This video discusses diversity visualizations. Here is a link to the Diversity section of the Parkinson's Mouse tutorial.

I think these resources will do a better job of explaining how to go about interpreting your results than could be given here, though I would be more than happy to try to answer any other questions you may have.

The short answer to the differences in the three files:

  • distance matrix: The distance matrix calculated from your data by the respective method.
  • PCoA: A dimensionality reduction step that allows for the generation of 3 dimensional visualizations from >3d distance matrices.
  • emperor.qzv: An actual visualization generated from your PCoA results.

After exploring your data in the Emperor visualization, you can test any interesting findings and generate a box-and-whisker plot using diversity beta-group-significance (DOCS) and the relevant distance_matrix.qza file.

As word of warning, there seems to be greatly different results between QIIME2 and Phyloseq for weighted Unifrac values (relevant discussion), with the take away being that there seem to be some issues with Phyloseq's implementation.

I am not sure what the values you obtained are, generally the more different species there are and the more evenly they are distributed, the higher the index. The maximum value(reached with a perfectly even distribution) is equal to the number of species in your sample.

Hope this helps you move forward!


Hi @Keegan-Evans Thank you so much for your reply. This has been wonderful and I has helped me too much.

I forget to attach the file for the Shannon index values. These are the values.


Thanks. I appreciate your help

@KatherinePena, those do look like some really high values. How did you obtain these? It looks like you have them in an Excel(or other spreadsheet) table.

Excel will collapse numbers with more precision than the cell can hold. For the first row, if the true Shannon value was 4,26045832866258(a very reasonable value from what I can see of your data) then excel would collapse it to 4,26045E+15(just way to high :joy:). I think this is probably what is going on.

Did you get these values from your shannon_vector.qza?


Hi @Keegan-Evans Thanks again for your reply.

Yes, I got the values from my shannon vector.qza. I attach a photo.

Could I use those values from my excel to make box and whisker plots in R, or should I make a modification because I have those very large values?


1 Like


My recommendation at this point would be to use the diversity alpha-group-significance tool (docs) to generate your box-and-whisker plots inside the QIIME 2 ecosystem.

This is the probably the safest way to make sure that you don't end up having some numerical formatting issue affecting your analysis. Also, you won't have to manually break anything open and it keeps all of your provenance information for you, which makes repeating your analysis much easier later on.

Your numbers are only so big in Excel because Excel itself has compromised them. The numbers that I see in your most recent screenshot look great!

If you really wanted to use R, I would be sure to copy it straight out the data from the qza and not your Excel sheet.


@Keegan-Evans thank you. I got it.

I am very grateful for your help.

1 Like

An off-topic reply has been split into a new topic: Error when running beta-group-significance

Please keep replies on-topic in the future.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.