*resending topic to moderators as I accidentally sent first topic w/o finishing the post (I'm so sorry!!)
RESEND
Please delete if not allowed, as I reference the statistical procedures of two programs
Hi everyone,
I am working with a 16S dataset of ~80 individuals. I have calculated alpha diversity in QIIME2 using the function
and in the phyloseq package in R using the function
plot_richness(df)
The phyloseq bow-and-whisker plot that is produced by this function estimates Shannon index values between 4 and 6, while QIIME2 calculates the Shannon index per individual between 6 and 8. I am concerned with the dissimilarity between the two programs. Both programs are using the same dataset as input. I am attaching the dataset I used with QIIME2 and with the singletons removed (because I thought this might have resulted in the lower Shannon value calculated by phyloseq) SUBSET_otu_table_w_tax_qiime_merged_nosingletons.txt (3.9 KB)
From what I understand, both of the programs use the natural logarithm and should calculate the same value for the same metrics. I have tried comparing rarefied vs nonrarefied datasets, as well as unfiltered datasets vs. datasets filtered to remove singletons but the results are inconsistent with QIIME2 (range from 6-8). I was hoping somebody would be able to shed some light on this matter!
Let's see what the qiime devs say. The Phyloseq code just calls vegan::diversity(), and vegan should be rock solid, but phyloseq does some preprocessing stuff like dropping OTUs that are not shared by both the OTU table and the tree, so upstream stuff could change the results even if the Shannon calculation is the same.
Hi @Muskox,
I recently ran into the exact same issue with Shannon diversity estimates with 2 different programs (except it was with another program, not phyloseq)
The choice of a logarithmic base corresponds to the choice of a unit for measuring information. If the base 2 is used the resulting units may be called binary digits, or more briefly bits...
If the base 10 is used the units may be called decimal digits.
Why is scikit-bio measuring bits? Does anyone want to start a heated argument about logarithmic bases with me?
I think Shannon has answered that for us — there is no canonical way, and the vegan R docs seem to say as much.
Probably because that's the most common way to measure entropy (a.k.a. Shannon's H) in information theory. So this is probably rooted in the different traditions of computer science vs. ecology.
Good find in the Vegan docs! Looks like we have reached a consensus: there is no consensus about what base to use then calculating Shannon’s Diversity Index.
I can see why the defaults would be different between computer science vs. ecology.
And hopefully, this also answers @Muskox’s question!
Wow @colinbrislawn and @Nicholas_Bokulich!!! Thank you for helping me sort this out. It does seem like we agree to disagree on just one method. Interesting link to the phyloseq code - I thought you had to script any preprocessing steps and wasn’t aware the program did some of that itself. Very useful insight from each of you, much appreciated!