Reference database for clustering


I am working with human gut microbiota V3-4 region. Its is one paired-end sample that I am practicing on. I denoised the sample and now I want to cluster.

It is mentioned in qiime2 documentation that clustering at certain % identity requires the denoised data that I have and a reference database clustered at the same % identity. I want to use the SILVA database.

  1. In order to process the database, I will be following the RESCRIPT tutorial up until this step of the 1st part for preparing the Silva database:
qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-515f-806r.qza \
    --i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-515f-806r-uniq.qza \
    --o-dereplicated-taxa  silva-138.1-ssu-nr99-tax-515f-806r-derep-uniq.qza

By this I would have the dereplicated database, is this correct?

  1. Can I use dereplicated database instead of clustered one for clustering?
  2. If I want to cluster my database at 85% identity threshold, what would be the command to use?
  3. Can I use 99% silva to cluster at 85%?

Thank you!

1 Like

Hi @Rakaya,

RE 1:


RE 2

There is no need to cluster your reference sequences to match the clustering of your data, for reasons previously mentioned within these threads:


RE 3:

Simply add the paramater --p-perc-identity 0.85 to that command. :warning: : I would not recommend using 85%. This is just provided as an example to drastically reduce the database size.

RE 4:

Yes you can classify your 85% clustered reads against the 99% SILVA database. Again, I'd avoid 85%. Also, the content I linked for RE 2, applies here too.