need some help with vsearch open reference

Hi, Qiimers,

While I normally use ASVs from Deblur, I would like to have a look at OTUs clustered at 98 and 95% similarity. Following tutorials, I’ve put together some scripts, but I need some help to fix them. I am not sure right now how to link the clustering step to the GreenGenes database that I am using. Here is an example:

qiime vsearch cluster-features-open-reference
–i-sequences …/step.02.a.deblur.repseqs.qza
–i-table …/step.02.a.deblur.table.qza
–i-reference-sequences path2files/ref.otus.99.qza
–p-perc-identity 0.95
–p-strand both
–o-clustered-table otu.openClust.95.table.qza
–o-clustered-sequences otu.openClust.95.seqs.qza
–o-new-reference-sequences newRefSeqs.openClust.95.qza

My issue is that the GreenGenes database only has otus clustered at 94, 97 and 99. I was wondering how proper it would be to associate --p-perc-identity 0.95 with the closest reference OTUs (e.g. --i-reference-sequences path2files/ref.otus.94.qza), or should I rather go down the de novo clustering route. I would then still assign taxonomies using the most similar reference dataset (i.e. reference 97% OTUs for 98% clustering, and 94% OTUs for 95% clustering).

Thank you for your kind attention, and I look forward to your feedback,

Hi @mstagliamonte ,
I guess this kind of depends on what the purpose of your analysis/clustering is.
When you say you want to look at 98/95% clustering do you mean you want your query reads to be clustered at 98/95% identity or you want to compare your reads against 98/95% clustered GG database?

My guess is you are interested in how clustering affects your data, in which case I would recommend using the 99% GG reference database as reference and simply set perc identity to 95 (as you have already). This will give you the best resolution of clustering your reads at 95% identity against the maximum number of representative sequences within the GG database.

The taxonomy file you use downstream should always match your reference sequences, so if you use 99% GG, then use 99% taxonomy.

I hope I understood your question correctly, please correct me if I missed it.

1 Like

Hi, @Mehrbod_Estaki ,

Thank you for your kind answer. You are correct, I am interested in the clustering of my data, and then assigning taxonomy to the OTUs I get. What prompt me to write this post is that from the two iterations (95% and 98%), I got almost the same number of OTUs: 1536 from the 95% and 1538 from the 98%. Of these, by name, 1517 are present in both datasets. This seemed odd to me. Here is the full pipeline:

qiime vsearch cluster-features-open-reference \

--i-sequences ../step.02.a.deblur.repseqs.qza \
--i-table ../step.02.a.deblur.table.qza \
--i-reference-sequences path2Files/GreenGenes13_8_v4/ref.otus.99.qza \
--p-perc-identity 0.95 \
--p-strand both \
--o-clustered-table otu.openClust.95.table.qza \
--o-clustered-sequences otu.openClust.95.seqs.qza \
--o-new-reference-sequences newRefSeqs.openClust.95.qza

qiime feature-classifier classify-sklearn \

--i-classifier path2Files/GreenGenes13_8_v4/ \
--i-reads otu.openClust.95.seqs.qza \
--o-classification step.07.95.taxonomy.qza

mkdir biomresults95

qiime tools export \

--input-path otu.openClust.95.table.qza \
--output-path biomresults95

mv biomresults95/feature-table.biom biomresults95/feature-table95.biom

biom convert -i biomresults95/feature-table95.biom -o biomresults95/feature-table95.txt --to-tsv

The same script was run with the 98% counterpart. What do you think about the overlapping OTUs?

Looking froward to your feedback,

Edit: I am not sure why not all backslashes are appearing, but they are present in the original code where necessary.

1 Like

Hi @mstagliamonte ,
Thanks for the updates. I’m not sure I see anything wrong here to be honest. Since you are using open clustering and not de novo I imagine what is happening is that a) your sequences are well covered by the GG database and b) it just so happens that at the region you have sequenced, most of your clusters are mapping back to the same reference clusters in the 99% GG, regardless of setting 95 or 98% identity threshold.
I’m going to guess that the handful of OTUs that are not shared mostly have de novo feature IDs rather than inheriting GG IDs?
I made this little diagram in case it helps with putting the pieces together and show an example of how you can end up with mainly the same OTU table with lots of overlap. Hope it helps

update: a little error in the diagram that I fixed


Hi, @Mehrbod_Estaki ,

Thank you again for your feedback, I appreciate it. My doubts came from the fact that in the past I was using a different pipeline, and I was getting clearly different number of OTUs between 95 and 98% clustering. Therefore, seeing such similarity when using Qiime 2 is quite surprising. I need to do some more homework on the topic I guess, and give a more in-depth look at what is happening under the hood.


1 Like

Hi @mstagliamonte ,

Always a good idea! Do let us know if you find anything interesting, I agree that if you saw very drastic differences between two pipelines it certainly warrants a closer look. Keep us posted.