Dear QIIME Team
Does qiime2 support using PICRUST? I may need generating files for Picrust analysis. But don't know if qiime2 can exporting pertinent files for this.
thanks!
Dear QIIME Team
Does qiime2 support using PICRUST? I may need generating files for Picrust analysis. But don't know if qiime2 can exporting pertinent files for this.
thanks!
Hi @hongwei2017! PICRUSt requires an OTU table generated from closed-reference OTU picking using a supported reference database (e.g. Greengenes). PICRUSt can also be used with other reference databases if the pre-trained databases aren't what you want to use. I can't get into those details here but the PICRUSt docs cover these topics.
As of the latest 2017.9 release that went live last week, QIIME 2 now supports closed-reference and de novo OTU picking. You can install 2017.9 and perform closed-reference OTU picking with qiime vsearch cluster-features-closed-reference
, then export the OTU table using qiime tools export
and use the resulting .biom
file with PICRUSt.
Note: I've heard rumors of a PICRUSt plugin being written for QIIME 2 but don't know any more details. Perhaps a plugin will be available someday to make this process easier!
@jairideout
thanks a lot for your replying! good to know all these.
Hi @jairideout,
Is there a way to plug Deblured/DADA2ed data (i.e., ref seq and otu table) into the closed-reference pipeline to get GG OTU names on sOTU taxon tags?
Alternatively, could one (as in - is this legitimate) blast the deblured sequences against GG 99% (or 97%) rep set maps, and propagate the data to the deblured OTU table?
It's weird to report diversity using debluered data and predicted metagenomes using close-refed data.
Thanks!
Gil
Hi @Gil_Sharon!
Yes, I think that would work: denoise your data with DADA2 or Deblur, then feed the feature table and representative sequences into qiime vsearch cluster-features-closed-reference
, along with Greengenes reference sequences matching the exact version of Greengenes that PICRUSt was trained on (or re-training PICRUSt on the reference database/version of your choice).
@wasade, any thoughts here?
Thanks, @jairideout!
Gave it a try -
!qiime tools import \
--input-path /gg_13_5_otus/rep_set/99_otus.fasta \
--output-path gg_13_5_otu_99.qza \
--type 'FeatureData[Sequence]'
!qiime vsearch cluster-features-closed-reference \
--i-sequences rep-seqs-deblur.qza \
--i-table table_deblur.qza \
--i-reference-sequences gg_13_5_otu_99.qza \
--p-perc-identity 1 \
--p-threads 0 \
--output-dir closedRef_forPICRUSt
EDIT - seems to have worked well. I got an OTU table with GG OTU IDs.
There's a set of sequences that were not matched. I suppose this is because I used 99% and per-iden 1 (most likely due to some taxon filtering). Going to feed these to PICRUSt and see what I get.
Thanks again!
Gil
Went through PICRUSt with no issues, thanks @jairideout.
a suggestion - it would be helpful to maintain the taxon id and the closed ref OTU ID (form closed red clustering against GG 99%), and maybe integrate to the deblur/dada2 pipeline (at least as a workaround until there's a proper picrust plugin) so that the feature table has both IDs thorughout the analysis.
Excellent! I didn't realize that qiime vsearch
supported a mode to expand an existing feature table. I've previously recommended performing an expansion using the otu_map.txt
from QIIME1's pick_closed_reference_otus.py
step.
Yea, I couldn't get it to work back then, not sure what I was missing. Either way, that suggestion definitely inspired this question, so thanks!
That's great news! I think you're one of the first QIIME 2 users to try out this workflow (ASVs --> closed-reference --> PICRUSt)
That's an interesting idea! To be honest, this workaround would be pretty low priority for us to implement (at least within the next few months) -- it'd probably be worth having someone implement a PICRUSt plugin in the meantime. If you or anyone else is interested in implementing this plugin we have a Plugin Developer's Guide to get you started!
Not at the level of writing plugins just yet. Still trying how to use the Artifact API instead of the command line interface.
Thanks for the help!
Hi @Gil_Sharon, have you tried adjusting the --p-perc-identity and got more matched sequences? I followed your example (99 otu, --p-perc-identity 1) on a real dataset, and got less than 16% of features matched to the closed OTUs. I am adjusting the --p-perc-identity now, but would like to hear your updates about this.
Hi @jjmmii, no. In a sense, it might beat the purpose. I think using the 97% otus would be a better compromise.
This might also be an issue of choosing the right reference database. In my (now dated) experience, GG isn’t always the best choice. Silva might work better for you.
Good luck!
@jjmmii, out of curiosity, what environment were the samples sourced from? That's quite a low read recruitment.
Hi @wasade, the environment is human feces. Yes, I wonder if this is considered normal to have 16% features matching exactly the sequences in 99 otu file.
I agree with @Gil_Sharon that adjusting --p-perc-identity especially to anything lower than the clustering threshold of the otu file might "in a sense beat the purpose", that is, using 99otu in this case means --p-perc-identity shouldn't be less than 0.99, right?
That is quite surprising. Normally, I'd expect in and around 85% of the reads to recruit to Greengenes if they're from human fecal, similarly for SILVA. Would it be possible to share a subset of the sequences which fail to recruit? Related, and if you have bandwidth, could you rerun 97% against the 97% OTUs and report the percentage which recruit?
Best,
Daniel
Other folks have also had trouble getting the DADA2 to closed-ref OTU conversion to work well. One solution is described in this GitHub repo. They even repeat the initial validation to show that it works.
https://github.com/vmaffei/dada2_to_picrust
Let me know if that helps,
Colin
Thank you @wasade and @colinbrislawn for your comments. Will definitely try DADA2 -> ASR but still like to know what happened to my vsearch run.
@wasade, I made a mistake in counting the % recruitment. I reran several parameters and my updated stats are described below:
Firstly, my data is here: (link removed) (730.3 KB) And here are my clustered_table.qzv (343.8 KB) and unmatched_sequences.qzv (1.5 MB) from my Greengenes 99otu --p-perc-identity 1 run. My vsearch command is
qiime vsearch cluster-features-closed-reference --i-sequences rep-seqss/rep-seqs_subbatch-1.trunc430_ee6.qza --i-table tables/table_subbatch-1.trunc430_ee6.qza --i-reference-sequences 99_otus.qza --p-perc-identity 1 --p-threads 0 --output-dir closedRef_forPICRUSt_subbatch-1
In clustered_table.qzv, it says the Number of features is 1383, I count this as the number of recruited features. For the unrecruited features, I don't know how to directly count them. I used a theoretical maximum number of features by running 99_otus using a very low --p-perc-identity 0.8, so that all reads are clustered (i.e. unmatched_sequences.qza is empty), and subtracted using this number. Please see this table (and a lazy graph) for the results of my benchmarking:
For example, the maximum number of OTUs for 99_otus is 3825 (trial 4), when running with --p-perc-identity 1, the % recruitment is 1383/(1383+2442) = 36%.
The 16% I previously reported was because I counted the number of unrecruited features by exporting unmatched_sequences.qzv to get a dna-sequences.fasta file (please get this by viewing unmatched_sequences.qzv and downloading FASTA from there), then made sure it doesn't contain duplicate sequences, then simply counted the number of sequences, which equals 7319. So the old % recruitment was 1383/(1383+7319) = 16% but that is now obviously wrong (I found out because the sum of number of unrecruited and recruited features is inconsistent across different --p-perc-identity).
You may have noticed in the name of my rep-seqs file that I allowed --p-max-ee 6 in the reads, would this have affected your estimation?
I am using Greengenes 13_8.
In the vsearch command, the --i-reference-sequences is from gg_13_8_otus/rep_set/99_otus.fasta .
This makes sense to me. Also, the fact that reducing the threshold for recruitment increases total recruited reads also makes sense.
Yes, but I'm not sure how much. By allowing reads with more expected errors, you may have 'noisy' OTUs, which would inflate OTU count and also decrease the number of ASVs that recruit to greengenes. I really like stringent max-ee values.
I still don't trust this 'cluster then recruit' method. The largest limitation in PiCRUST is the mismatch between actual ASVs and known microbes in greengenes, but the original pipeline let's you measure this mismatch by calculating NTI (Nearest Taxon Index). The additional recruitment further separates your data from the reference.
Hi @jjmmii, would it be possible to express percentage recruitment in terms of the percentage of sequence recruited? In other words, if my input sequence file has 100 sequences, 75 of which are represented by the database by 10 OTUs, I would consider this a 75% read recruitment. Does that make sense?
Best,
Daniel