How to train a UNITE classifier using RESCRIPt

:construction: This tutorial is a work in progress. Please leave us questions and comments so we can better help you use this database. All feedback is welcome.

If you are looking for a pre-trained ITS classifier, you can download one here! :inbox_tray:

How to train a UNITE ITS classifier using RESCRIPt.

The UNITE community maintains a database of the eukaryotic nuclear internal transcribed spacer (ITS) region. The data comes from all eukaryotic ITS sequences from the International Nucleotide Sequence Database Collaboration and is provided for use in multiple formats.

In this short tutorial, we'll show you how to download and format this data using RESCRIPt.

If you use RESCRIPt and the UNITE database, please cite the following:

  • Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
    Abarenkov, Kessy, R. Henrik Nilsson, Karl-Henrik Larsson, Andy F. S. Taylor, Tom W. May, Tobias Guldberg Frøslev, Julia Pawlowska, et al. 2023. “The UNITE Database for Molecular Identification and Taxonomic Communication of Fungi and Other Eukaryotes: Sequences, Taxa and Classifications Reconsidered.” Nucleic Acids Research , November. doi: 10.1093/nar/gkad1039.
  • Specific DOI for Database Used: See https://unite.ut.ee/repository.php for examples.

Please see the UNITE "How to cite?" page for more details.

To run this tutorial you'll need:

Download data from UNITE. :inbox_tray:

Several versions of the UNITE database are available. Any combination of which can be accessed via RESCRIPt:

  • 'fungi only' or 'eukaryotes (including fungi)'
  • clustered at 99%, 97%, or 'dynamic' thresholds
  • with or without 'singletons'
qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-no-singletons \
    --verbose \
    --output-dir uniteDB

We recommend always downloading all Eukaryotes because this ensures we have outgroups for better classification of fungi / not-fungi.

Optional Curation :hammer_and_wrench:

You may want to construct an amplicon-specific classifier or curate the database in other ways. See these RESCRIPt tutorials for more inspiration:

Example UNITE database preparation:

Remove sequences with unhelpful taxonomy:

qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy uniteDB/taxonomy.qza \
    --i-sequences uniteDB/sequences.qza \
    --o-filtered-sequences uniteDB/sequences-filtered.qza

We likely do not require the specific accessions as annotated within the UNITE taxonomy.
Let's edit-taxonomy to make our classifier more efficient.

qiime rescript edit-taxonomy \
    --i-taxonomy uniteDB/taxonomy.qza \
    --o-edited-taxonomy uniteDB/taxonomy-no-SH.qza \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

Train a naive-bayes classifier on UNITE :bar_chart:

This uses the example curation we performed above:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads uniteDB/sequences-filtered.qza \
    --i-reference-taxonomy uniteDB/taxonomy-no-SH.qza \
    --o-classifier uniteDB/classifier.qza

Evaluate the classifier

This can take a while :hourglass_flowing_sand:

qiime rescript evaluate-fit-classifier \
    --i-sequences uniteDB/sequences-filtered.qza   \
    --i-taxonomy uniteDB/taxonomy-no-SH.qza \
    --p-n-jobs 2 \
    --o-classifier uniteDB/classifier.qza \
    --o-evaluation uniteDB/classifier-evaluation.qzv \
    --o-observed-taxonomy uniteDB/predicted-taxonomy.qza

qiime rescript evaluate-taxonomy \
  --i-taxonomies uniteDB/taxonomy-no-SH.qza uniteDB/predicted-taxonomy.qza \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats uniteDB/both-taxonomy-evaluation.qzv

Additional notes: :spiral_notepad:

You can see all database variations we support by running:

qiime rescript get-unite-data --help

Want a different version? Open an issue on GitHub!

Now you're ready to use UNITE!
:mushroom: :microbe: :qiime2:

4 Likes

An off-topic reply has been split into a new topic: UNITE, Include certain species of fungi

Please keep replies on-topic in the future.

Hi there,

Question about the 'Optional Curation' part. In the training classifiers tutorial, it mentions at the bottom that "fungal ITS classifiers trained on the UNITE reference database do NOT benefit from extracting/trimming reads to primer sites. We recommend training UNITE classifiers on the full reference sequences".

I am wondering if it's best to not train the classifier by adding the primer region? Thank you.

Yes, that's what is being suggested here.

If you would like to try this, here are classifiers I've trained on the full UNITE database:

I'm not sure what will work best on your data!
You could use your positive controls to test this. (Did you included positive controls on the sequencing run.)

3 Likes