Picrust support?

Thanks, @jairideout!

Gave it a try -

!qiime tools import \
--input-path /gg_13_5_otus/rep_set/99_otus.fasta \
--output-path gg_13_5_otu_99.qza \
--type 'FeatureData[Sequence]'
!qiime vsearch cluster-features-closed-reference \
--i-sequences rep-seqs-deblur.qza \
--i-table table_deblur.qza \
--i-reference-sequences gg_13_5_otu_99.qza \
--p-perc-identity 1 \
--p-threads 0 \
--output-dir closedRef_forPICRUSt

EDIT - seems to have worked well. I got an OTU table with GG OTU IDs.
There’s a set of sequences that were not matched. I suppose this is because I used 99% and per-iden 1 (most likely due to some taxon filtering). Going to feed these to PICRUSt and see what I get.

Thanks again!


Went through PICRUSt with no issues, thanks @jairideout.

a suggestion - it would be helpful to maintain the taxon id and the closed ref OTU ID (form closed red clustering against GG 99%), and maybe integrate to the deblur/dada2 pipeline (at least as a workaround until there’s a proper picrust plugin) so that the feature table has both IDs thorughout the analysis.

1 Like

Excellent! I didn’t realize that qiime vsearch supported a mode to expand an existing feature table. I’ve previously recommended performing an expansion using the otu_map.txt from QIIME1’s pick_closed_reference_otus.py step.

1 Like

Yea, I couldn’t get it to work back then, not sure what I was missing. Either way, that suggestion definitely inspired this question, so thanks!

1 Like

That’s great news! I think you’re one of the first QIIME 2 users to try out this workflow (ASVs --> closed-reference --> PICRUSt) :tada:

That’s an interesting idea! To be honest, this workaround would be pretty low priority for us to implement (at least within the next few months) – it’d probably be worth having someone implement a PICRUSt plugin in the meantime. If you or anyone else is interested in implementing this plugin we have a Plugin Developer’s Guide to get you started!

1 Like

Not at the level of writing plugins just yet. Still trying how to use the Artifact API instead of the command line interface.

Thanks for the help!


Hi @Gil_Sharon, have you tried adjusting the --p-perc-identity and got more matched sequences? I followed your example (99 otu, --p-perc-identity 1) on a real dataset, and got less than 16% of features matched to the closed OTUs. I am adjusting the --p-perc-identity now, but would like to hear your updates about this.

Hi @jjmmii, no. In a sense, it might beat the purpose. I think using the 97% otus would be a better compromise.

This might also be an issue of choosing the right reference database. In my (now dated) experience, GG isn’t always the best choice. Silva might work better for you.

Good luck!

1 Like

@jjmmii, out of curiosity, what environment were the samples sourced from? That’s quite a low read recruitment.

Hi @wasade, the environment is human feces. Yes, I wonder if this is considered normal to have 16% features matching exactly the sequences in 99 otu file.

I agree with @Gil_Sharon that adjusting --p-perc-identity especially to anything lower than the clustering threshold of the otu file might “in a sense beat the purpose”, that is, using 99otu in this case means --p-perc-identity shouldn’t be less than 0.99, right?

That is quite surprising. Normally, I’d expect in and around 85% of the reads to recruit to Greengenes if they’re from human fecal, similarly for SILVA. Would it be possible to share a subset of the sequences which fail to recruit? Related, and if you have bandwidth, could you rerun 97% against the 97% OTUs and report the percentage which recruit?



Other folks have also had trouble getting the DADA2 to closed-ref OTU conversion to work well. One solution is described in this GitHub repo. They even repeat the initial validation to show that it works.

Let me know if that helps,


Thank you @wasade and @colinbrislawn for your comments. Will definitely try DADA2 -> ASR but still like to know what happened to my vsearch run.

@wasade, I made a mistake in counting the % recruitment. I reran several parameters and my updated stats are described below:

Firstly, my data is here: (link removed) (730.3 KB) And here are my clustered_table.qzv (343.8 KB) and unmatched_sequences.qzv (1.5 MB) from my Greengenes 99otu --p-perc-identity 1 run. My vsearch command is
qiime vsearch cluster-features-closed-reference --i-sequences rep-seqss/rep-seqs_subbatch-1.trunc430_ee6.qza --i-table tables/table_subbatch-1.trunc430_ee6.qza --i-reference-sequences 99_otus.qza --p-perc-identity 1 --p-threads 0 --output-dir closedRef_forPICRUSt_subbatch-1

In clustered_table.qzv, it says the Number of features is 1383, I count this as the number of recruited features. For the unrecruited features, I don't know how to directly count them. I used a theoretical maximum number of features by running 99_otus using a very low --p-perc-identity 0.8, so that all reads are clustered (i.e. unmatched_sequences.qza is empty), and subtracted using this number. Please see this table (and a lazy graph) for the results of my benchmarking:

For example, the maximum number of OTUs for 99_otus is 3825 (trial 4), when running with --p-perc-identity 1, the % recruitment is 1383/(1383+2442) = 36%.

The 16% I previously reported was because I counted the number of unrecruited features by exporting unmatched_sequences.qzv to get a dna-sequences.fasta file (please get this by viewing unmatched_sequences.qzv and downloading FASTA from there), then made sure it doesn't contain duplicate sequences, then simply counted the number of sequences, which equals 7319. So the old % recruitment was 1383/(1383+7319) = 16% but that is now obviously wrong (I found out because the sum of number of unrecruited and recruited features is inconsistent across different --p-perc-identity).

You may have noticed in the name of my rep-seqs file that I allowed --p-max-ee 6 in the reads, would this have affected your estimation?
I am using Greengenes 13_8.
In the vsearch command, the --i-reference-sequences is from gg_13_8_otus/rep_set/99_otus.fasta .

1 Like

This makes sense to me. Also, the fact that reducing the threshold for recruitment increases total recruited reads also makes sense.

Yes, but I’m not sure how much. By allowing reads with more expected errors, you may have ‘noisy’ OTUs, which would inflate OTU count and also decrease the number of ASVs that recruit to greengenes. I really like stringent max-ee values.

I still don’t trust this ‘cluster then recruit’ method. The largest limitation in PiCRUST is the mismatch between actual ASVs and known microbes in greengenes, but the original pipeline let’s you measure this mismatch by calculating NTI (Nearest Taxon Index). The additional recruitment further separates your data from the reference.

Hi @jjmmii, would it be possible to express percentage recruitment in terms of the percentage of sequence recruited? In other words, if my input sequence file has 100 sequences, 75 of which are represented by the database by 10 OTUs, I would consider this a 75% read recruitment. Does that make sense?


2 replies have been split into a new topic: Importing data for closed-reference

Sure @wasade! Sorry for late reply. It looks like this now:

I'll do the same for SILVA, please wait.

1 Like

68% for the 97% similarity to the 97% OTUs seems a bit low but is plausible. Are there other host characteristics that you may be able to share (e.g., age range, country of origin, disease, etc)?


but is plausible

Happy to hear that :slight_smile:

The data is from the iHMP prediabetes project https://www.hmpdacc.org/ihmp/ . The subjects are from various countries and have symptoms of onset of type 2 diabetes. Can’t provide more details from restricted metadata which requires permission to access.

Thank you.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.