This tutorial is a work in progress. Please leave us questions and comments so we can better help you use this database. All feedback is welcome.
How to train a UNITE ITS classifier using RESCRIPt.
The UNITE community maintains a database of the eukaryotic nuclear internal transcribed spacer (ITS) region. The data comes from all eukaryotic ITS sequences from the International Nucleotide Sequence Database Collaboration and is provided for use in multiple formats.
In this short tutorial, we'll show you how to download and format this data using RESCRIPt.
If you use RESCRIPt and the UNITE database, please cite the following:
- Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
Abarenkov, Kessy, R. Henrik Nilsson, Karl-Henrik Larsson, Andy F. S. Taylor, Tom W. May, Tobias Guldberg Frøslev, Julia Pawlowska, et al. 2023. “The UNITE Database for Molecular Identification and Taxonomic Communication of Fungi and Other Eukaryotes: Sequences, Taxa and Classifications Reconsidered.” Nucleic Acids Research , November. doi: 10.1093/nar/gkad1039. - Specific DOI for Database Used: See https://unite.ut.ee/repository.php for examples.
Please see the UNITE "How to cite?" page for more details.
To run this tutorial you'll need:
- Updated version of QIIME 2
- install the developer version of RESCRIPt.
Download data from UNITE.
Several versions of the UNITE database are available. Any combination of which can be accessed via RESCRIPt:
- 'fungi only' or 'eukaryotes (including fungi)'
- clustered at 99%, 97%, or 'dynamic' thresholds
- with or without 'singletons'
qiime rescript get-unite-data \
--p-version 9.0 \
--p-taxon-group eukaryotes \
--p-cluster-id dynamic \
--p-no-singletons \
--verbose \
--output-dir uniteDB
We recommend always downloading all Eukaryotes because this ensures we have outgroups for better classification of fungi / not-fungi.
Optional Curation
You may want to construct an amplicon-specific classifier or curate the database in other ways. See these RESCRIPt tutorials for more inspiration:
Example UNITE database preparation:
Remove sequences with unhelpful taxonomy:
qiime taxa filter-seqs \
--p-exclude Fungi_sp,mycota_sp,mycetes_sp \
--i-taxonomy uniteDB/taxonomy.qza \
--i-sequences uniteDB/sequences.qza \
--o-filtered-sequences uniteDB/sequences-filtered.qza
We likely do not require the specific accessions as annotated within the UNITE taxonomy.
Let's edit-taxonomy
to make our classifier more efficient.
qiime rescript edit-taxonomy \
--i-taxonomy uniteDB/taxonomy.qza \
--o-edited-taxonomy uniteDB/taxonomy-no-SH.qza \
--p-search-strings ';sh__.*' \
--p-replacement-strings '' \
--p-use-regex
Train a naive-bayes classifier on UNITE
This uses the example curation we performed above:
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads uniteDB/sequences-filtered.qza \
--i-reference-taxonomy uniteDB/taxonomy-no-SH.qza \
--o-classifier uniteDB/classifier.qza
Evaluate the classifier
This can take a while
qiime rescript evaluate-fit-classifier \
--i-sequences uniteDB/sequences-filtered.qza \
--i-taxonomy uniteDB/taxonomy-no-SH.qza \
--p-n-jobs 2 \
--o-classifier uniteDB/classifier.qza \
--o-evaluation uniteDB/classifier-evaluation.qzv \
--o-observed-taxonomy uniteDB/predicted-taxonomy.qza
qiime rescript evaluate-taxonomy \
--i-taxonomies uniteDB/taxonomy-no-SH.qza uniteDB/predicted-taxonomy.qza \
--p-labels ref-taxonomy predicted-taxonomy \
--o-taxonomy-stats uniteDB/both-taxonomy-evaluation.qzv
Additional notes:
You can see all database variations we support by running:
qiime rescript get-unite-data --help
Want a different version? Open an issue on GitHub!
Now you're ready to use UNITE!