Taxonomy barplot .csv file for statistical analysis

arvindkannan · October 25, 2019, 1:07am

I am new to bioinformatics and not very familiar QIIME2 or the different terminologies. I used QIIME2 on miseq analysis of wastewater samples to obtain a taxonomy barplot.

I have a .csv file of the taxonomy barplot. I would like to know if I could use this to do some statistical analysis like pearson correlation, canonical corresponding analysis using external statistical softwares like SAS, or SPSS?

Also, what is the difference between an OTU table and the .csv file obtained from the taxonomy barplot? I see that most published literatures have used OTUs for their statistical analysis to calculate beta diversity and alpha diversity or exported their OTU table to R vegan package for further analysis. I am really not sure what data to use for running my statistical analysis.
Thanks

jwdebelius · October 25, 2019, 7:49am

Hi @arvindkannan,

Okay, first, this is a complex question and I'm writing this with my first for the morning, just as fair warning. (Both for spelling and the suggestion that you may also want a , , or other drink of your choosing.)

Some more details about your pipeline would be helpful to figure out what you have and what you need. In general, I suspect that your "barplot" file is collapsed at some taxonomic level - genus maybe - and you have genus labels. It's maybe got those labels for your row labels and then the samples in the columns?

An OTU table is one of the rawer forms of data. It takes the sequences and clusters them into Operational Taxonomic Units (OTUs) which are an older species proxy. Most people are using denoisers now (Dada2, Deblur), which given better resolution than an OTU. But, at the end of the day, an OTU/ASV table is a table of 16s rRNA sequences.

We can use those sequences directly to build a tree or infer taxonomy. The tree lets us look at the relationship between organisms based on their sequences, which can be a better way to measure relationships than names. Most of the common names were based on physically observable characteristics and sometimes closely related organisms don't look the same way. ( is closer to than is a ). (As a side note, have you seen all the recent press about how chickens are dinosaurs ? Because that's in the same vein.)

We can also use the OTUs/ASVs and predict a name based on what we already known about other bacteria and their names. If you take those taxonomy labels, which often look this:

k__Animalia; p__Chordata; c__Mammalia; o__Carnivora; f__Mustelidae; g__Enhydra; s__lutris
k__Animalia; p__Chordata; c__Mammalia; o__Carnivora; f__Ursideae; g__; s__
k__Animalia; p__Chordata; c__Mammalia; o__Artidactyla; f__Balaenopteridae; g__Balaenoptera; s__
k__Animalia; p__Chordata; c__Mammalia; o__Artidactyla; f__Bovidea; g__Bos; s__taurus
k__Animalia; p__Chordata; c__Mammalia; o__Perissodactyla; g__Equidae; g__Eqqus; s__ferus
k__Anamalia; p__Chordata; c__Aves; O__Falliformes; f__Phasiandae; g__Gallus
k__Anamalia; p__Chordata; c__Dinosauria; o__Saurischia; f__Tyrannosauridae; g__Tyrranosaurus

(Remember how I said => phylogenetically? Check out those taxonomic labels!)

We could take those labels and collapse them at a level so we might re-label a group of OTUs by, say, class, which would give us

k__Animalia; p__Chordata; c__Mammalia
k__Anamalia; p__Chordata; c__Aves
k__Anamalia; p__Chordata; c__Dinosauria

Collapsing taxonomies makes it easier to make plots, for example, but it also means you're giving up valuable information! I find it rare that I see behavior at a family level that didn't come from just 1 or 2 OTUs/ASVs that I could find just as easily on their own.

My recommendation would not be to work with SAS, Stata, or SPSS for microbiome data. In general, its bad ot assume your microbiome data is independent and normal (except under very specific circumstances) and my (somewhat limited) experience with these programs is that they don't do a good job with the complexities of microbiome data. I think Microbiome datasets are compositional: And this is not optional should be required reading for anyone doing feature-based analysis in the field, and hopefully this explains more deeply than I can why this is important. There are also a lot of forum posts on the topic that I'd recommend searching.

I will admit that I try to avoid SAS when ever possible (much to the dismay of certain bosses), but I don't think it has the capability to handle this kind of data. For these analyses, you need to work in QIIME, R, Python, or something similar. It's not an issue of taking the data out of QIIME2, its an issue of using the right tools for the job and unfortunately, that's not SAS. ANCOM, on the other hand...

I'm less familiar with CCA and its assumptions, but NMDS or PCoA are more common here, and primarily used for beta diversity. You can run those in QIIME or R, but they typically like a distance matrix that you precalculate rather than a euclidean assumption. (Again, Im not a CCA person so if its non-euclidean, I apologize and it may be appropriate).

Hope this helps make things slighly clearer.

Best,
Justine