Shannon Index Varies with Phyloseq

Muskox · August 23, 2018, 2:23am

*resending topic to moderators as I accidentally sent first topic w/o finishing the post (I'm so sorry!!)
RESEND
Please delete if not allowed, as I reference the statistical procedures of two programs

Hi everyone,

I am working with a 16S dataset of ~80 individuals. I have calculated alpha diversity in QIIME2 using the function

qiime diversity alpha \ --i-table filtered-table-merged.qza \ --p-metric shannon \ --o-alpha-diversity shannon_vector.qza

and in the phyloseq package in R using the function

plot_richness(df)

The phyloseq bow-and-whisker plot that is produced by this function estimates Shannon index values between 4 and 6, while QIIME2 calculates the Shannon index per individual between 6 and 8. I am concerned with the dissimilarity between the two programs. Both programs are using the same dataset as input. I am attaching the dataset I used with QIIME2 and with the singletons removed (because I thought this might have resulted in the lower Shannon value calculated by phyloseq) SUBSET_otu_table_w_tax_qiime_merged_nosingletons.txt (3.9 KB)

From what I understand, both of the programs use the natural logarithm and should calculate the same value for the same metrics. I have tried comparing rarefied vs nonrarefied datasets, as well as unfiltered datasets vs. datasets filtered to remove singletons but the results are inconsistent with QIIME2 (range from 6-8). I was hoping somebody would be able to shed some light on this matter!

colinbrislawn · August 23, 2018, 5:44am

Wow! Me too!

Let's see what the qiime devs say. The Phyloseq code just calls vegan::diversity(), and vegan should be rock solid, but phyloseq does some preprocessing stuff like dropping OTUs that are not shared by both the OTU table and the tree, so upstream stuff could change the results even if the Shannon calculation is the same.

Let's see what the Qiime devs say...

Colin

Nicholas_Bokulich · August 23, 2018, 4:34pm

Hi @Muskox,
I recently ran into the exact same issue with Shannon diversity estimates with 2 different programs (except it was with another program, not phyloseq)

:qiime2: uses scikit-bio for calculating these diversity metrics — and it looks like scikit-bio Shannon uses log base 2 not log base e.

Other than that, @colinbrislawn's suggestions re: how phyloseq processes feature tables upstream could add to the dissimilarity in results.

I hope that helps!

colinbrislawn · August 23, 2018, 6:32pm

Interesting!

So now I'm wondering, what is the canonical way to calculate Shannon's Diversity Index? Are we supposed to use log2() or ln()?

According to the original paper written by Shannon (long PDF), you can choose whatever base makes sense for your data:

The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits...
If the base 10 is used the units may be called decimal digits.

Why is scikit-bio measuring bits? Does anyone want to start a heated argument about logarithmic bases with me?

Colin

Nicholas_Bokulich · August 23, 2018, 7:05pm

I think Shannon has answered that for us — there is no canonical way, and the vegan R docs seem to say as much.

Probably because that's the most common way to measure entropy (a.k.a. Shannon's H) in information theory. So this is probably rooted in the different traditions of computer science vs. ecology.

But that's just a bit of speculation

colinbrislawn · August 23, 2018, 7:51pm

Good find in the Vegan docs! Looks like we have reached a consensus: there is no consensus about what base to use then calculating Shannon's Diversity Index.

I can see why the defaults would be different between computer science vs. ecology.

And hopefully, this also answers @Muskox's question!

Colin

Muskox · August 29, 2018, 3:44am

Wow @colinbrislawn and @Nicholas_Bokulich!!!! Thank you for helping me sort this out. It does seem like we agree to disagree on just one method. Interesting link to the phyloseq code - I thought you had to script any preprocessing steps and wasn't aware the program did some of that itself. Very useful insight from each of you, much appreciated!