How to obtain the 99% similarity clustering sequence of Greengenes2？

wfgui · September 18, 2025, 5:38am

Hi,
I downloaded the sequence file of Greengenes2 (2024.09.backbone.full-length.nb.qza, 2024.09.backbone.tax.qza) and used the following command to create the classifier in the V34 area. I would like to know if it is necessary to cluster the Greengenes2 sequences with a 99% similarity and then create a classifier?
The command is as follows：

qiime feature-classifier extract-reads \
         --i-sequences 2024.09.backbone.full-length.fna.qza \
         --p-f-primer GTGCCAGCMGCCGCGGTAA \
         --p-r-primer GGACTACHVGGGTWTCTAAT \
         --p-min-length 400 \
         --p-max-length 500 \
         --o-reads v34.ref-seqs.qza \
         --p-n-jobs 8
qiime feature-classifier fit-classifier-naive-bayes  \
        --i-reference-reads v34.ref-seqs.qza \
        --i-reference-taxonomy 2024.09.backbone.tax.qza \
        --o-classifier 2024.09.v34-classifier.qza

Thanks!

SoilRotifer · September 19, 2025, 6:12pm

Hi @wfgui,

No, it is not necessary. In fact, it is often better to retain all unique reference sequences within the reference database.

If you are looking to to save on memory and resource requirements then you can simply run:

qiime rescript dereplicate --p-mode uniq ...

If you really would like to make a 99% "clustered" reference you can do:

qiime rescript dereplicate --p-mode uniq --p-perc-identity 0.99 ...

Again, I'd advise against this... There is potential to loose more robust classification, as there are some members contained within the family or genus level that can be erroneously clustered together ,even at 99% similarity threshold.

The historical reason for pre-generating pre-clustered reference databases at 99%, 94%, etc..., was to reasonably allow users to run classification on machines with limited resources.

system · October 21, 2025, 12:13am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.