Reference database for clustering

Hello,

I am working with human gut microbiota V3-4 region. Its is one paired-end sample that I am practicing on. I denoised the sample and now I want to cluster.

It is mentioned in qiime2 documentation that clustering at certain % identity requires the denoised data that I have and a reference database clustered at the same % identity. I want to use the SILVA database.

  1. In order to process the database, I will be following the RESCRIPT tutorial up until this step of the 1st part for preparing the Silva database:
qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-515f-806r.qza \
    --i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-515f-806r-uniq.qza \
    --o-dereplicated-taxa  silva-138.1-ssu-nr99-tax-515f-806r-derep-uniq.qza

By this I would have the dereplicated database, is this correct?

  1. Can I use dereplicated database instead of clustered one for clustering?
  2. If I want to cluster my database at 85% identity threshold, what would be the command to use?
  3. Can I use 99% silva to cluster at 85%?

Thank you!

1 Like

Hi @Rakaya,

RE 1:

Yep.

RE 2

There is no need to cluster your reference sequences to match the clustering of your data, for reasons previously mentioned within these threads:

Additionally:

RE 3:

Simply add the paramater --p-perc-identity 0.85 to that command. :warning: : I would not recommend using 85%. This is just provided as an example to drastically reduce the database size.

RE 4:

Yes you can classify your 85% clustered reads against the 99% SILVA database. Again, I'd avoid 85%. Also, the content I linked for RE 2, applies here too.

2 Likes

Thank you for the reply, it explained a lot of things I did not know.
I do, however, have another question as I am trying to cluster.
Is it better to cluster against a full length dereplicated database or amplicon-region specific classifier?

I personally like to use the amplicon-specific classifier, for reasons outlined by Werner et al. 2011.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.