Clustering taxonomy sequences


I wonder if there a possibility to perform OTU clustering 99% around taxonomy sequences within qiime software.

I found vsearch tool for similar purpose but it requires some frequency table. In fact I don't have and don't need such table for that analysis. I have only original sequences with taxonomy labels.

Also I would wish that such clustering work as filter ( so some sequences are uniting in cluster with some most represantative label and others remain the same)

Most close I found in ReSCRIPT like "dereplicate" function, but it works like 100% Clustering, so I need the same but 99% clustering.

Is such possibility within qiime? If no - could you suggest software for such tools.

Thank you for attention

Hi @biojack,

Everything you need is available within the QIIME 2 documentation. If you search for OTU clustering you'll find:

You can perform otu clustering on your raw sequences or on your denoised (deblur, DADA2) sequences.

Hi @SoilRotifer

I read tutorial and it seems that there are two cases that I already described

  1. using deriplicate ( which is 100% clustering, not 99% by the fact)
  2. using vsearch cluster which use frequency table of features at input which I don't have and shouldn't must have for my task

Maybe I miss smth?

In fact I need something simple command like

qiime vsearch cluster --i-sequences seqs.qza  --p-perc-identity 0.99 --o-clustered-sequences seqs_NR99.qza

The page I linked you to shows this command:

qiime vsearch dereplicate-sequences
--i-sequences seqs.qza
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

The table will be made for you. Then you proceed to cluster in the following step:

qiime vsearch cluster-features-de-novo
--i-table table.qza
--i-sequences rep-seqs.qza
--p-perc-identity 0.99
--o-clustered-table table-dn-99.qza
--o-clustered-sequences rep-seqs-dn-99.qza


1 Like

Ohh.. Thanks! Looks like workaround

Interesting, why there is no possibility to perform this just as one command :thinking:

Anyway thanks for your support

1 Like

This is how vsearch works. Also, it is more efficient to do dereplicate the data first, then cluster from that dereplicated set. Especially, if you’d like to cluster at several different similarity levels, then there is no need to dereplicate each time.

But I suppose a pipeline for this would be something to consider. :thinking:

Good luck! :smile:



just wondering

I looked in RESCRIPt dereplicate manual and there is already --p-perc-identity parameter. So probably it would be better use RESCRIPt for same purpose instead of vsearch? If it do the same thing

If your intent is to make a reference database based on clustered sequences, then yes you can use this approach. But if you are simply clustering your reads to generate OTUs for analyses, then you should use vsearch.

In general, we do not recommend clustering reads for the purposes of making a reference database as your ability to correctly assign taxonomy to your reads declines. This is covered in our RESCRIPt manuscript. Usually, clustering reference sequences is performed to reduce the file and memory size of the reference database when computational resources are limited.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.