*resending topic to moderators as I accidentally sent first topic w/o finishing the post (I'm so sorry!!)
Please delete if not allowed, as I reference the statistical procedures of two programs
I am working with a 16S dataset of ~80 individuals. I have calculated alpha diversity in QIIME2 using the function
qiime diversity alpha \ --i-table filtered-table-merged.qza \ --p-metric shannon \ --o-alpha-diversity shannon_vector.qza
and in the phyloseq package in R using the function
The phyloseq bow-and-whisker plot that is produced by this function estimates Shannon index values between 4 and 6, while QIIME2 calculates the Shannon index per individual between 6 and 8. I am concerned with the dissimilarity between the two programs. Both programs are using the same dataset as input. I am attaching the dataset I used with QIIME2 and with the singletons removed (because I thought this might have resulted in the lower Shannon value calculated by phyloseq) SUBSET_otu_table_w_tax_qiime_merged_nosingletons.txt (3.9 KB)
From what I understand, both of the programs use the natural logarithm and should calculate the same value for the same metrics. I have tried comparing rarefied vs nonrarefied datasets, as well as unfiltered datasets vs. datasets filtered to remove singletons but the results are inconsistent with QIIME2 (range from 6-8). I was hoping somebody would be able to shed some light on this matter!
Wow! Me too!
Let’s see what the qiime devs say. The Phyloseq code just calls
vegan::diversity(), and vegan should be rock solid, but phyloseq does some preprocessing stuff like dropping OTUs that are not shared by both the OTU table and the tree, so upstream stuff could change the results even if the Shannon calculation is the same.
Let’s see what the Qiime devs say…
I recently ran into the exact same issue with Shannon diversity estimates with 2 different programs (except it was with another program, not phyloseq)
uses scikit-bio for calculating these diversity metrics — and it looks like scikit-bio Shannon uses log base 2 not log base e.
Other than that, @colinbrislawn's suggestions re: how phyloseq processes feature tables upstream could add to the dissimilarity in results.
I hope that helps!
So now I’m wondering, what is the canonical way to calculate Shannon’s Diversity Index? Are we supposed to use
According to the original paper written by Shannon (long PDF), you can choose whatever base makes sense for your data:
The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits…
If the base 10 is used the units may be called decimal digits.
Why is scikit-bio measuring bits? Does anyone want to start a heated argument about logarithmic bases with me?
I think Shannon has answered that for us — there is no canonical way, and the vegan R docs seem to say as much.
Probably because that’s the most common way to measure entropy (a.k.a. Shannon’s H) in information theory. So this is probably rooted in the different traditions of computer science vs. ecology.
But that’s just a bit of speculation
Good find in the Vegan docs! Looks like we have reached a consensus: there is no consensus about what base to use then calculating Shannon’s Diversity Index.
I can see why the defaults would be different between computer science vs. ecology.
And hopefully, this also answers @Muskox’s question!
Wow @colinbrislawn and @Nicholas_Bokulich!!! Thank you for helping me sort this out. It does seem like we agree to disagree on just one method. Interesting link to the phyloseq code - I thought you had to script any preprocessing steps and wasn’t aware the program did some of that itself. Very useful insight from each of you, much appreciated!
This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.