Classifier Training Questions

hongwei2017 · September 6, 2017, 4:55am

Can I ask one question about training my classifier?
I've downloaded references from the resource page for greengene. There are different folders inside the unzip folder:

(1) otus (2) rep_set (3) rep_set_aligned (4) taxonomy (5) trees.

From the tutorial, it says I need the "reference sequences" and the "corresponding taxonomic classifications".

my question is: for reference sequences, should I use (2) or (3)? and why (1) and (5) here in folder, what's their function in training.

Additionally, can you explain more about why we should train a classifier, is it unique for qiime, or popular for all softwares for doing taxonomy analysis.

jairideout · September 6, 2017, 9:31pm

Hi @hongwei2017!

You'll want to use the files in the rep_set folder. For example, you could use rep_set/99_otus.fasta and taxonomy/99_otu_taxonomy.txt to train your classifier.

The rep_set_aligned folder contains the aligned representative sequences, which you don't need to use with the feature classifiers currently available in QIIME 2. Theoretically there could be a feature classifier implemented in the future that requires aligned reference sequences as input, but I don't know of one offhand.

Those files aren't necessary for training feature classifiers. Greengenes is a 16S reference database that can be used for purposes other than training feature classifiers. The otus folder contains "OTU maps", which describe the sequences associated with each Greengenes OTU. The trees folder contains phylogenetic trees built from the aligned representative sequences. These trees can be used, for example, in phylogenetic diversity calculations such as UniFrac when you've performed closed-reference OTU picking on your data.

Training a classifier isn't a unique concept to QIIME or any other tool; it is a general technique used in the field of machine learning to train models. Machine learning-based algorithms are pretty popular for taxonomic classification, so often there is a training step involved. For example, take a look at the RDP classifier, which implements a naive Bayes classifier that must be trained on reference sequences. In QIIME 2, we provide a similar type of naive Bayes classifier via qiime feature-classifier classify-sklearn, which also requires a training step.

We additionally provide other types of feature classifiers that don't require a training step. These are consensus-based classifiers that use sequence similarity searching and alignment to determine taxonomic classification (the commands are classify-consensus-blast and classify-consensus-vsearch).

Thus, the need for a training step depends on what kind of feature classification algorithm you're using.

Note: In QIIME 1 it is possible to train a feature classifier (e.g. the RDP classifier) as a separate step, but the default behavior of assign_taxonomy.py is to train a classifier on-the-fly each time the script is executed. Thus, if you're used to QIIME 1's workflow, the training step happens but it's often not apparent to users.

hongwei2017 · September 7, 2017, 3:39am

Hi @jairideout,

Your explanations are really appreciated! Many thanks!

Cheers
hongwei

system · October 8, 2017, 9:39am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.