Clustering taxonomy sequences

biojack · August 3, 2022, 10:53am

Hi

I wonder if there a possibility to perform OTU clustering 99% around taxonomy sequences within qiime software.

I found vsearch tool for similar purpose but it requires some frequency table. In fact I don't have and don't need such table for that analysis. I have only original sequences with taxonomy labels.

Also I would wish that such clustering work as filter ( so some sequences are uniting in cluster with some most represantative label and others remain the same)

Most close I found in ReSCRIPT like "dereplicate" function, but it works like 100% Clustering, so I need the same but 99% clustering.

Is such possibility within qiime? If no - could you suggest software for such tools.

Thank you for attention

SoilRotifer · August 3, 2022, 3:46pm

Hi @biojack,

Everything you need is available within the QIIME 2 documentation. If you search for OTU clustering you'll find:

https://docs.qiime2.org/2022.2/tutorials/otu-clustering/

You can perform otu clustering on your raw sequences or on your denoised (deblur, DADA2) sequences.

biojack · August 3, 2022, 5:05pm

Hi @SoilRotifer

I read tutorial and it seems that there are two cases that I already described

using deriplicate ( which is 100% clustering, not 99% by the fact)
using vsearch cluster which use frequency table of features at input which I don't have and shouldn't must have for my task

Maybe I miss smth?

biojack · August 3, 2022, 5:08pm

In fact I need something simple command like

qiime vsearch cluster --i-sequences seqs.qza  --p-perc-identity 0.99 --o-clustered-sequences seqs_NR99.qza

SoilRotifer · August 3, 2022, 5:10pm

The page I linked you to shows this command:

qiime vsearch dereplicate-sequences
--i-sequences seqs.qza
--o-dereplicated-table table.qza
--o-dereplicated-sequences rep-seqs.qza

The table will be made for you. Then you proceed to cluster in the following step:

qiime vsearch cluster-features-de-novo
--i-table table.qza
--i-sequences rep-seqs.qza
--p-perc-identity 0.99
--o-clustered-table table-dn-99.qza
--o-clustered-sequences rep-seqs-dn-99.qza

-Mike

biojack · August 3, 2022, 5:15pm

Ohh.. Thanks! Looks like workaround

Interesting, why there is no possibility to perform this just as one command

Anyway thanks for your support

SoilRotifer · August 3, 2022, 5:21pm

This is how vsearch works. Also, it is more efficient to do dereplicate the data first, then cluster from that dereplicated set. Especially, if you’d like to cluster at several different similarity levels, then there is no need to dereplicate each time.

But I suppose a pipeline for this would be something to consider.

Good luck!

biojack · August 4, 2022, 4:10pm

@SoilRotifer

just wondering

I looked in RESCRIPt dereplicate manual and there is already --p-perc-identity parameter. So probably it would be better use RESCRIPt for same purpose instead of vsearch? If it do the same thing

SoilRotifer · August 4, 2022, 4:31pm

If your intent is to make a reference database based on clustered sequences, then yes you can use this approach. But if you are simply clustering your reads to generate OTUs for analyses, then you should use vsearch.

In general, we do not recommend clustering reads for the purposes of making a reference database as your ability to correctly assign taxonomy to your reads declines. This is covered in our RESCRIPt manuscript. Usually, clustering reference sequences is performed to reduce the file and memory size of the reference database when computational resources are limited.

system · September 4, 2022, 10:32pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.