Some questions about this tutorials

I am teaching Qiime2 using the tutorial here ( I would like to check something about the section – " Taxonomic analysis"

Here is what it says in the tutorials " This classifier was trained on the Greengenes 13_8 99% OTUs, where the sequences have been trimmed to only include 250 bases from the region of the 16S that was sequenced in this analysis (the V4 region, bound by the 515F/806R primer pair). We’ll apply this classifier to our sequences, and we can generate a visualization of the resulting mapping from sequence to taxonomy."

1> 515F/806R means the Earth Microbiome Project’s 16S rRNA primers and V4 regions. If I use other primers or same gene markers 16S rRNA but different region such as V1. I have to trained my own taxonomic classifier. Am I correct?

2> It mentions this taxa classifier was trained using Greengenes 13_8 99% OTUs. I downloaded Greengenes V13_8 here (

After I unzip this, you would get several folders. I used QIIME 1 before, when I use pick_closed_OTU method, I use the “99_otus.fasta” in the rep_set folder as references and “99_otu_taxonomy.txt” in the taxonomy folder for taxonomic name.

Can you tell me which files did you use to train Greengenes’ taxonomic classifier.

3> Back to old days, people use Qiime 1and OTU picking method. Since we use 97% similarity, some people use greengenes database 97% datasets. In this case, they would use 97_otus.fasta and 97_otu_taxonomy.txt?

I notice that QIIME 2 recommends using 99 database. I am wondering how did you train ths classifier. Both references and taxonomic information are used 99 greengenes database, when you trained your own classifier.

In future, if I want to trained my own classifier, I should always try to find 99 level database? If not, I should use the finest level? As far as I know, some database doesn’t have this kind of level (COI, 28SrRNA).

Just want to know the standard? The fine the better?


Hi @moonlight,

Yes, since you are using different primers targeting a different region you should train a classifier specific to that.

If you look at pre-trained greengenes classifiers from the data resource page, in their Provenance tab, it shows that 99_otu_taxonomy.txt and 99_otus.fasta were used, which is what I would recommend as well.

The reference database clustered at 97% identity vs the query sequences being clustered at 97% identity are 2 different processes that need not depend on each other. For higher resolution, one should use 99% with both. This choice really depends on the data and the question you are asking though. That being said, today it is generally recommended to use denoising methods (DADA2 and DEBLUR available in qiime2) over OTU clustering methods anyways. There may some instances where OTU picking is needed/preferred, in those rare case it is still recommended to use denoising methods initially, then cluster the results (the ASVs). ASVs are essentially 100% OTUs but with much better quality than traditional OTU picking methods which had inferior quality control.

Yes, that would be the recommendation to use the highest resolution available. I’ve never used COI databases but you should be able to use ITS databases (like UNITE) instead that are superior databases and have high resolution reference sequences. Not sure about 28S, sorry.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.