rarefaction vs. relative abundance

nathaniel_hubert · October 7, 2020, 3:56pm

Hello!
Really enjoying this awesome workshop. Thank you!
I have always taken issue with rarefaction. I understand it is a "necessary evil" in some cases. i have compared analyses performed on rarefied, non-rarefied absolute abundance, and "tricked" QIIME into accepting relative abundance by multiplying by 10^6 and rounding up to eliminate decimals and make rounding errors negligible. Given enough sequencing depth across samples (omitting poor quality samples with very shallow sequencing), the differences between these outputs are very similar. I feel that in most cases, this rounded (relative abundance * 10^6) method is superior to rarefaction as it eliminates the need to impose limits and omit data. Is it really "wrong" to perform analyses in this manner?
Thank you! Nate

ChrisKeefe · October 7, 2020, 9:46pm

@Nathan_Hubot, I'm going to leave the main part of your question to others.

I think you'll be happy to know, however, that some common diversity metrics can now accept FeatureTable[RelativeFrequency] data without any manipulation required. Specifically, any measures used in core-metrics-phylogenetic except for bray_curtis allow this. qiime diversity alpha, beta, alpha-phylogenetic, and beta-phylogenetic should let you run those calculations.

nathaniel_hubert · October 8, 2020, 7:04pm

Thank you, Chris!

That is very helpful. Is there any non-phylogenetic distance that can be used in combination with relative frequencies? I prefer non-phylogenetic distances as I've read how a single nucleotide difference in 16S can be associated with significant difference in biological function.

Thank you for your help.

Nate

colinbrislawn · October 11, 2020, 6:39pm

Normalization methods are totally independent from distance metrics, so no matter how your normalize, you can use any of the distances/dissimilarities listed here:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html#scipy.spatial.distance.pdist

I like Jaccard distances as they are true distances (not dissimilarities) and easy to understand. (Jaccard distance is percent of features not shared between samples, so 70% shared features == 0.3 Jaccard distance.) There is an example how to generate this distance in the Moving Picture tutorial.

Given enough sequencing depth across samples (omitting poor quality samples with very shallow sequencing), the differences between these outputs are very similar. I feel that in most cases, this rounded (relative abundance * 10^6) method is superior to rarefaction as it eliminates the need to impose limits and omit data. (emphasis mine)

Do you mean that results are similar in practice, but relative abundance is superior in concept?

Colin

nathaniel_hubert · October 12, 2020, 5:53am

Thank you Colin,

Appreciate your help. My mention of distance metric was in reference to Chris' mention of core-metrics-phylogenetic accepting relative frequency as opposed to rarefied data. I had asked what distance metrics can be used in combination with relative frequency that are non-phylogenetic.

Does it violate assumptions to use Bray Curtis distance in combination with relative frequency data?

Thank you,
Nate

ChrisKeefe · October 12, 2020, 5:33pm

core-metrics-phylo

core-metrics-phylogenetic produces all of the non-phylogenetic metrics from core-metrics with some phylo metrics thrown in for spice (faith's pd, weighted, and unweighted unifrac). Non-phylo metrics included:

Alpha Diversity:

Shannon’s diversity index
Observed Features
Pielou’s evenness

Beta Diversity:

Jaccard distance
Bray-Curtis distance

Bray-Curtis

Though bray_curtis doesn't currently accept relative frequency (RF) data, I don't think passing relative frequency data breaks any assumptions of Bray Curtis, @nathaniel_hubert. McMurdie & Holmes used RF with Bray Curtis with no issues, and I didn't see any reason not to implement it when I was tinkering with the formula.

If I remember correctly, expediency was the main reason Bray Curtis doesn't accept RF data yet - there were some unexpected test failures with RF data and that method, and we opted to ship what we had rather than delaying release of other methods in favor of that feature.

I'd love to see this enhancement made available in q2-diversity-lib, and have opened an issue to track progress. This isn't currently a high priority for me, but I would be super happy to lend a hand if it's something you're interested in working on!

nathaniel_hubert · October 12, 2020, 10:48pm

Thank you again for your incredibly helpful and informative reply, Chris!

Very glad to hear it does not violate assumptions to use RF and BC. I would really love to help make this available, but unfortunately do not have the necessary coding skillset.

It is one of my prime goals to learn more about writing scripts, the most fundamental part of the analyses I use daily. I've gotten my feet wet a few times in piecing together the workflow I use for analyses. A formal education would be ideal.

It is my understanding that most of these programs are modular and assembling the right blocks of code in an appropriate arrangement may suffice. Seems even using the existing BC pipeline with my method to "trick" QIIME while introducing negligible error would work (though using a standard relative frequency table would be more desirable).

Maybe with the proper guidance I could figure it out, but don't know where to start. Please let me know if and how I can help - perhaps you could just push me in the right direction. Thank you again very much! Nate

ChrisKeefe · October 12, 2020, 11:47pm

@nathaniel_hubert, if you plan to continue creating and running complex bioinformatics pipelines, getting comfortable with python may be worth your while. It will give you access to some powerful tools (the QIIME 2 Artifact API, among others), and is a relatively "easy" language in which to practice with iteration and control flow. I'm not the guy to work with you on scripting stuff, but there are lots of good data science resources online.

If you do decide you're interested in trying to add this functionality to QIIME 2, just let me know by @ mentioning me here. I'd be happy to share resources, answer questions about the codebase, and provide guidance as time allows. This probably isn't the fastest way to produce the study results you need, or the most direct path to rad scripting skills, but I'm sure we'd both learn something along the way.

Good luck out there!
Chris