I am looking for some additional resources or suggestions to help me get started using PICRUSt2. I am using the qiime2 plugin for PICRUSt2 (installed as directed here: q2 picrust2 Tutorial · picrust/picrust2 Wiki · GitHub) and have been through the brief mammalian stool tutorial successfully. I’ve also read through the picrust wiki and tutorials.
I have ~300 human gut microbiome samples and have previously followed a filtering/processing/diversity analysis workflow using phyloseq and MaAsLin2. (An outside lab ran the 16S sequencing and taxonomic classification using qiime2/dada2 and provided me with the files.)
I have a few questions for getting started with running the q2-PICRUSt2 pipeline:
Should I start with the original data (the .qza files) provided to me or the filtered data? I ask because I did some taxonomic/prevalence filtering in phyloseq; thus the number of taxa in my samples is different in my post-phyloseq workflow. Would this impact my PICRUSt results? Based on the information in the PICRUSt2 tutorial, it seems like I would be OK to use the filtered data. (PICRUSt2 Tutorial (v2.3.0 beta) · picrust/picrust2 Wiki · GitHub)
If the answer to #1 is that it is OK to use my post-phyloseq data, what is the best way to export the phyloseq object into PICRUSt?
I’m pretty new to qiime2/picrust/etc, so thanks for any advice.
Thanks for taking the time to read through all the tutorials/docs first!
The way I see it, you have a couple of options here:
Use your phyloseq table directly into the stand alone PICRUSt2 tool, that way you don’t have to worry about importing into Q2. This tutorial here should be able to help you export your phyloseq data out of R, note I haven’t tested the code myself though. Importing dada2 and Phyloseq objects to QIIME 2 . The added bonus here is that, currently, q2-picrust2 only works with version <2019.10 so you’d have to install an additional older version of QIIME 2.
Use the original .qza files you have in QIIME2 with q2-picrust2 without filtering. I don’t think some filtering is going to have a significant effect on your PICRUSt2 analysis. In fact, to save you computational time/resources I would even recommend doing some filtering on your qiime2 tables first, so matching your phyloseq filtering seems like a good idea. Take a look at this tutorial for the various ways you can filter your qiime2 table: Filtering data — QIIME 2 2021.2.0 documentation
I was able to follow the tutorial and export my phyloseq objects from R. However, I am thinking it makes more sense to use the original .qza files. q2-picrust2 requires a FeatureData[Sequence] artifact; my phyloseq object doesn’t use this piece (just the Frequency and Taxonomy artifacts), and I’m not sure if I can use a filtered Frequency artifact and an unfiltered Sequence artifact- are there issues with matching them up?
Additionally, I am currently running the q2-picrust2 pipeline and it seems to be taking forever. Is there a recommended maximum number of samples or taxa to use for the qiime2 plugin? It’s supposed to take ~15 min and 3.8 GB of RAM but it’s been more than 30 minutes now.
Note: It does seem like the q2-picrust2 is now working for qiime2-2021.2. This is the version I used when following the installation and didn’t run into any problems (perhaps because I was also installing a brand new Qiime2 Virtual Box setup with the 2021.2 Core). I didn’t have any problems with 2021.2 when running the mammals tutorial.
Yes, phyloseq doesn't need a representative sequence file because it has access to that info on the fly within a phyloseq object's table. However, in the post I linked earlier, it shows how to create one so you can import into qiime2, see this section:
Bonus: Export and Representative Sequences from dada2:
So ultimately, you can export what you need from phyloseq or use the original files you had already in QIIME 2. Totally up to your preference.
I'm wondering where you are getting this estimate? From my experience the full pipeline usually takes much longer than this, and of course the more representative sequences you have the longer it takes.
Estimating run time of bioinformatic processes is extremely difficult because there are so many variables at play, as long as you are not getting an error though I recommend just letting it just run. Making sure you are allocating lots of RAM to the environment/task is a good idea to avoid running out of memory in the middle of the task. Increasing number of threads with --p-threads parameter can also be used to significantly increase run time (when a task allows it), however often at the expense of requiring more RAM.
Great! Glad to hear the issue has now been resolved.
Thanks for your responses! I ultimately decided to use the original files from QIIME2 and the re-process them in the same way that I had it in phyloseq. (I also managed to export the phyloseq OTU table to a .biom format using this information: Importing dada2 and Phyloseq objects to QIIME 2 - #4 by ChristianEdwardson).
I’m wondering where you are getting this estimate?
The tutorial says that the full PICRUSt2 pipeline takes about "~15 min and 3.8 GB of RAM". Perhaps this is just for the mammal data. Mine took about an hour, even though I allocated 10GB to the virtual box- could just be that my computer is slow, or that I have more samples.
Any suggestions for additional downstream tools? I imported my pathways & ko back into R to do some ordination and PERMANOVA testing. I also read that some people like to use the STAMP software, but I've also seen some instances of differential abundance testing for the metagenomes and pathways. However, I haven't been able to figure out if this is kosher or not.
Hi @basil0125,
Sorry just noticed this reply now, must have missed its notification.
The ~15 minute estimate on that tutorial is related to the tutorial dataset, so of course every data is going to take different time depending on the size, structures, and complexity of your data. But of course CPU power, memory, and etc are going to have an important effect as well. But glad you were successful and it completed!
This is a pretty open ended question with no real right answer but I think you're on the right track. You can treat the new function feature table you get as if it were your regular bacteria table. So the same diversity analyses like alpha and beta diversity might be useful, and even the same differential abundance tools can be used. Whether you use STAMP or not is up to you, there'll be people that will tell you its fine, but I'm thinking most people these days will tell you there are better options. STAMP was made some time ago, and it was convenient at a time where we hadn't developed alot of the other current tools. It does not address compositionality (at least the original versions I used) of your data, which I understand not even agrees with, but even if you don't think microbiome data is compositional, there are still better and more sophisticated tools that I would recommend. Here is a list of some of those tools I put together you can browse through. Ultimately which of those you use will be based on your question and the level of complexity you're willing to work with. Here is an recent paper that highlights the importance of which DA tool you use: https://www.biorxiv.org/content/10.1101/2021.05.10.443486v1.abstract