I'm eager to initiate a discussion and seek insights from those with more experience regarding my current consideration. When I began my MSc degree, I had no prior coding experience and opted to learn QIIME2 and Python, favoring QIIME2's excellent documentation, community, and comprehensive coverage of the entire process from sample handling to statistical analysis.
After two years, I've acquired considerable Python proficiency, and in my early PhD stage, I primarily engage with QIIME2 as an API, coupled with other widely-used Python packages. While this has been rewarding, it's evident that R offers significant advantages, particularly with dedicated packages for microbiome and multi-omics, streamlining statistical analyses and visualizations. An example of the inconvenience in QIIME2 is the output of a PERMANOVA test being a qzv file from the adonis function, requiring extra steps like saving and re-uploading for further analysis and plotting.
This brings me to the pivotal question - is it worthwhile to invest time in learning R? What level of learning curve should I anticipate? It's crucial to note that my primary focus revolves around post-processing data, starting with a feature table and metadata after the DADA2 procedure on amplicon data.
I'd greatly appreciate hearing your thoughts on this matter!
As someone who uses both, I would say it is worth learning (assuming you have time) as some of the R packages for data analysis available in R are fantastic - although personally I do not find them as intuitive as python.
Here is a tutorial which is 5 years old but should mostly still stand that uses qiime2R and then does commonly used analyses within R. Some of the analysis are available within qiime2 itself, and if you get stuck I've found Bard (yes the AI) is quite good for general R code although it does struggle with someone of the packages.
Tutorial: Integrating QIIME2 and R for data visualization and analysis using qiime2R - Community Contributions / Tutorials - QIIME 2 Forum
Ive been in the field for about a decade and fighting this fight for nearly as long. The short answer is that you can get away with knowing basic R and grumbling whenver you have to use it, as long as you're prepared to be relatively lonely and fight most of the time. Im kind of a kermudgon, and if I didnt have this to be frustrated about, I'd be mad about something else.
My experience has been that you can make a jump relatively quickly: the base structure of the code is similar, and the logic is almost all the same. Its the little things that will get to you, like the way you index from 1 instead of 0; the inherent messiness of tidyverse trying to be a dataframe class and failing; the way functions are strucutred. Dont expect to be fast or fluid whhen you start out. My personal list fo reasons I dislike R are mostly irrelevant, beyond the most basic lack of testing int he language. There are packages you can trust (vegan) and packages where some basic functionality is misprogramed (i.e. weighted UniFrac in phyloseq). My approach is to do as much as I can in QIIME 2 or python and only use R when necessary. (I'll run adonis through a stand alone script, or make a modified ANCOM-BC if I need repeated measures). I dont know that I have any mixed repos public right now, but I can see if I can find examples.
Thanks for your input, and kudos for the responses!
@jwdebelius , I guess the fighting and FOMO is a constant companion in this realm.. I'm pretty much vibing with your mindset. When I tackled maaslin2 as I needed repeated measures, I opted for doing it using rpy2– not flawless, but it got the job done.
The whole "use R when necessary" thing is a bit of a puzzle, though. It's never easy to decide when it's absolutely necessary, is it? Another classic example, like the adonis scenario, is attempting to plot PCoA with CI ellipsoids. It's doable in Python, but seeing the Python script compared with phyloseq syntax make you think about that.
I mean, there's always FOMO. There's also 80 million different ways to analyze microbiome data. If you get 5 analysts in a room, they'll propose at least 15 analyses, more if you provide them some kind of beverage. Picking your best path through that field is why analysis is fun. And, whether people acknowledge it or not, coding language is part of that "best" path. I can tell you half a dozen stories about analyses I thought would have made a paper but I either couldn't run because of time, funding, compute, installation, or the fact that the implementation wasn't as flexible as the devs claimed.
It's always a judgement call of what you want ot do vs what you ideally could do. You cannot possibly do every analysis possible on your data set. You have to pick what the "right" answer is. I don't bother with centroids in my work, but I've found a marginal plot (jointplot in seaborn if youre not feeling fancy) gets me to a similar place.
Legitimately the only thing I haven't found a good substitute for are adonis and some fo the more fun differential abundance methods (ANCOM-BC2 w/ LME; phylofactor; adonis.)
This is a great question! Thank you for bringing this to the forums.
My FOMO goes the other way; I see the Python libraries for ML, natural language, and web services and wish I was better at Python!
I learned R first, which is why I find your question fascinating!
What level of learning curve should I anticipate?
I find modern Python and modern R remarkably similar.
- Data structures: lists, data frames
- Program flow: functions, implicit loops, pipes
- Graphs: plotly.R plotly.py (bonus: plotly.js lol)
- File formats: flat text (csv, tsv), HDF5, Arrow
- Package management: conda-forge
If your goal is to be a good programmer, you are not missing anything. It's all learning.
I used to struggle with this too.
Now I just use whatever language everyone else is using and put it on the resume.
And now, it's way easier to learn a new language:
Same, Jono. I'm glad I'm not the only one.