I am working with human gut microbiota V3-4 region. Its is one paired-end sample that I am practicing on. I denoised the sample and now I want to cluster.
It is mentioned in qiime2 documentation that clustering at certain % identity requires the denoised data that I have and a reference database clustered at the same % identity. I want to use the SILVA database.
In order to process the database, I will be following the RESCRIPT tutorial up until this step of the 1st part for preparing the Silva database:
qiime rescript dereplicate \
--i-sequences silva-138.1-ssu-nr99-seqs-515f-806r.qza \
--i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza \
--p-mode 'uniq' \
--o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-515f-806r-uniq.qza \
By this I would have the dereplicated database, is this correct?
Can I use dereplicated database instead of clustered one for clustering?
If I want to cluster my database at 85% identity threshold, what would be the command to use?
Can I use 99% silva to cluster at 85%?
There is no need to cluster your reference sequences to match the clustering of your data, for reasons previously mentioned within these threads:
Anyway, the old way of pre-clustering reference reads to 97% or 94% sequence similarity (as done with Greengenes and SILVA) was simply a logistical way to reduce the size of the reference database even further, so that it'll run on computers with limited memory and cpu power.
If possible, it is best to try and use reference reads that have been minimally clustered or simply dereplicated,
i.e. at 99% or 100% similarity. As you'll be able to more accurately classify your reads, that is you have more reference data to use.
So, if you are using reference reads that are 99% - 100% then you can use them to classify any reads that are clustered at different similarities, e.g. 97%, 94%, etc...
In fact, some consider classifying 97% clustered OTUs against a 97% clustered reference database not a great idea. That is, as your nearest reference sequence can be up to 6% away, further reducing classification.
Does this help?
In general, we do not recommend clustering reads for the purposes of making a reference database as your ability to correctly assign taxonomy to your reads declines. This is covered in our
RESCRIPt manuscript. Usually, clustering reference sequences is performed to reduce the file and memory size of the reference database when computational resources are limited.
Simply add the paramater
--p-perc-identity 0.85 to that command.
: I would not recommend using 85%. This is just provided as an example to drastically reduce the database size.
Yes you can classify your 85% clustered reads against the 99% SILVA database.
Again, I'd avoid 85%. Also, the content I linked for RE 2, applies here too.