I wonder if there a possibility to perform OTU clustering 99% around taxonomy sequences within qiime software.
I found vsearch tool for similar purpose but it requires some frequency table. In fact I don't have and don't need such table for that analysis. I have only original sequences with taxonomy labels.
Also I would wish that such clustering work as filter ( so some sequences are uniting in cluster with some most represantative label and others remain the same)
Most close I found in ReSCRIPT like "dereplicate" function, but it works like 100% Clustering, so I need the same but 99% clustering.
Is such possibility within qiime? If no - could you suggest software for such tools.
Thank you for attention
Everything you need is available within the QIIME 2 documentation. If you search for OTU clustering you'll find:
You can perform otu clustering on your raw sequences or on your denoised (deblur, DADA2) sequences.
I read tutorial and it seems that there are two cases that I already described
- using deriplicate ( which is 100% clustering, not 99% by the fact)
- using vsearch cluster which use frequency table of features at input which I don't have and shouldn't must have for my task
Maybe I miss smth?
In fact I need something simple command like
qiime vsearch cluster --i-sequences seqs.qza --p-perc-identity 0.99 --o-clustered-sequences seqs_NR99.qza
The page I linked you to shows this command:
qiime vsearch dereplicate-sequences
The table will be made for you. Then you proceed to cluster in the following step:
qiime vsearch cluster-features-de-novo
Ohh.. Thanks! Looks like workaround
Interesting, why there is no possibility to perform this just as one command
Anyway thanks for your support
This is how vsearch works. Also, it is more efficient to do dereplicate the data first, then cluster from that dereplicated set. Especially, if you’d like to cluster at several different similarity levels, then there is no need to dereplicate each time.
But I suppose a pipeline for this would be something to consider.
I looked in RESCRIPt
dereplicate manual and there is already --p-perc-identity parameter. So probably it would be better use RESCRIPt for same purpose instead of vsearch? If it do the same thing
If your intent is to make a reference database based on clustered sequences, then yes you can use this approach. But if you are simply clustering your reads to generate OTUs for analyses, then you should use vsearch.
In general, we do not recommend clustering reads for the purposes of making a reference database as your ability to correctly assign taxonomy to your reads declines. This is covered in our RESCRIPt manuscript. Usually, clustering reference sequences is performed to reduce the file and memory size of the reference database when computational resources are limited.