I’m very new to metabarcoding type analysis and have tons of questions that arise while I’m learning. One of them is, I wonder if exporting qiime2 outputs back to R and using packages like phyloseq is necessary.
For example, I have a dataset where my intention is to
- Characterise the bacterial communities of the host strain (alga)
- describe the core-bacterial communities of host strains
- find out if there’s a significant difference in bacterial community composition between host strains
To answer the above questions; 1 and 2, I found the following plugins in Qiime2:
- Taxa barplots
- feature-table core-features
I would like to know if my approach to answer question 1 and 2 is correct and also what type of plugin I should use to answer question 3 (in qiime2). Moreover, it’ll be great to know what difference it makes if I do this in R, because I really prefer doing my work in Qiime2.
Those are excellent questions but there is no definite answer to any of those inquiries. It all depends on how much flexibility and customization you want. Qiime 2 does have a very impressive number of plugins that can help analyse a variety of simple and complex question about your experiment, but it is not as customizeable, especially with regards to graphics.
So, for example, while there is a plugin to identify a core microbiome, there’s no easy way to visualize this unless you export this into another graphic environment (like R).
Characterize is a very vague term to me so I’m not sure what it is you are trying to do but bar plots is certainly one easy way to look at the composition of each sample. Check out the heatmap plugin as well. You can also create some nice ordination plots with beta diverstity and visualize with the emperor tool. Have a look through this overview tutorial diagram for a variety of options.
Yup, the core-features is the only plugin that deals with this question.
If you are looking at doing statistical testing on beta diversity matrices, you can use adonis or beta group-significance, or if you are looking at doing differential abundance testing (univariate test of each taxa/ratios across groups) you can use q2-ancom, q2-aldex2, q2-songbird, q2-corncob.
There’s also loads of other third party q2-plugins you can check out over at the QIIME 2 library.
Thank you so much for your reply.
Makes a lot of sense now.
Qiime visualizations are great, as you now know.
If you are an apt R programmer, you can indeed import the outputs of qiime2 (that is: the table.qza, the taxonomy,qza, the rooted-tree.qza and the sample-metadata.tsv) as a Phyloseq object (metadata needs second line as: #q2:types \t numeric \t catecorical … ).
My necessary packages install was:
Then, it can be a good idea to collapse the ASV and the names (taxa_glom function) because ASVs are not the best to work with. (command is — phyloseq::tax_glom(Phyloseq_Object, “Family”) — )
And there are a couple of great things you can do from there! Bunch of tutorials available online, like PCA, network analysis, DESeq2, etc.
Hope that helps, -JA
Thanks for the addition tips!
Could you please expand on this comment you made please? I do both phyloseq and non-phyloseq analysis in R and I use ASVs. I’ve never had any problems with them, nor would right off the bat suggest collapsing these at higher orders. Just curious what concerns you have.
I am always looking to get interpretable plots. And let’s say that you want to put loadings in a biplot-PCA, then ASV names are quite annoying (and it is impossible to fix that in qiime-emperor to my knowledge). Also, for classification algorithms (random forest or other), I find more interesting to work with actual taxa names rather than ASV’s UUID. Then the important features identified are usable as is (not by looking through taxonomy to change UUID to taxa names and finding out that most top10 are referring to the same taxonomic group).
I really should have written “because ASVs are *sometimes* not the best to work with.” (and thought I did). Sorry for that.
Cheers, -Jérémie Auger
You’re right, those ASV names are not publication friendly, and emperor doesn’t have a friendly way of dealing with them either yet, but this post
might offer a somewhat work around, so you can avoid those names, while keeping the vectors and their taxonomic designation which as you mentioned might be more interpretable. Not super elegant…
I think that is an interesting observation, but I would be hesitant to generalize that across all studies because it really will be dependent on the data. In fact my personal opinion is to operate at the highest resolution possible (ASVs) by default, unless you have a reason to do otherwise. For instance, your biological signal may exist at the genus level due some to shared trait. But in another example, I recently did some analysis on a project where I compared disease vs healthy mice and when I did differential abundance at the species level I found no differences in a species that we full expected to find a difference. But when I did the analysis again at the ASV level, turns out there were about 6-7 different ASVs belonging to that species and most of them did change as we expected, but 1 of them changed in the other direction and essentially masked the expected difference across the other members. So at the end of day I guess it really depends on the biological question, but I still think the default should be ASV then collapsing if you have some specific reason to do so. Others may have different opinions on this too but to me it seems a bit less risky of losing potential signals.
Btw, in phyloseq it is even more burdensome to maintain ASV names because they are the full DNA sequences, I always do rename them to ASV1, ASV2…and store the full DNA into the
refseq slot just in case. Looks much better!
No problem! I actually really appreciate reading about other’s take on these topics as they don’t have any right or wrong answers, ‘just depends’.
Thank you so much @Jeremie_Auger for taking your time to reply my queries. This is so helpful!