How to train a UNITE classifier using RESCRIPt

colinbrislawn · November 14, 2023, 7:33pm

This tutorial is a work in progress. Please leave us questions and comments so we can better help you use this database. All feedback is welcome.

If you are looking for a pre-trained ITS classifier, you can download one here!

How to train a UNITE ITS classifier using RESCRIPt.

The UNITE community maintains a database of the eukaryotic nuclear internal transcribed spacer (ITS) region. The data comes from all eukaryotic ITS sequences from the International Nucleotide Sequence Database Collaboration and is provided for use in multiple formats.

In this short tutorial, we'll show you how to download and format this data using RESCRIPt.

If you use RESCRIPt and the UNITE database, please cite the following:

Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
Abarenkov, Kessy, R. Henrik Nilsson, Karl-Henrik Larsson, Andy F. S. Taylor, Tom W. May, Tobias Guldberg Frøslev, Julia Pawlowska, et al. 2023. “The UNITE Database for Molecular Identification and Taxonomic Communication of Fungi and Other Eukaryotes: Sequences, Taxa and Classifications Reconsidered.” Nucleic Acids Research , November. doi: 10.1093/nar/gkad1039.
Specific DOI for Database Used: See https://unite.ut.ee/repository.php for examples.

Please see the UNITE "How to cite?" page for more details.

To run this tutorial you'll need:

Updated version of QIIME 2
install the developer version of RESCRIPt.

Download data from UNITE.

Several versions of the UNITE database are available. Any combination of which can be accessed via RESCRIPt:

'fungi only' or 'eukaryotes (including fungi)'
clustered at 99%, 97%, or 'dynamic' thresholds
with or without 'singletons'

qiime rescript get-unite-data \
    --p-version 9.0 \
    --p-taxon-group eukaryotes \
    --p-cluster-id dynamic \
    --p-no-singletons \
    --verbose \
    --output-dir uniteDB

We recommend always downloading all Eukaryotes because this ensures we have outgroups for better classification of fungi / not-fungi.

Optional Curation

You may want to construct an amplicon-specific classifier or curate the database in other ways. See these RESCRIPt tutorials for more inspiration:

Example UNITE database preparation:

Remove sequences with unhelpful taxonomy:

qiime taxa filter-seqs \
    --p-exclude Fungi_sp,mycota_sp,mycetes_sp \
    --i-taxonomy uniteDB/taxonomy.qza \
    --i-sequences uniteDB/sequences.qza \
    --o-filtered-sequences uniteDB/sequences-filtered.qza

We likely do not require the specific accessions as annotated within the UNITE taxonomy.
Let's edit-taxonomy to make our classifier more efficient.

qiime rescript edit-taxonomy \
    --i-taxonomy uniteDB/taxonomy.qza \
    --o-edited-taxonomy uniteDB/taxonomy-no-SH.qza \
    --p-search-strings ';sh__.*' \
    --p-replacement-strings '' \
    --p-use-regex

Train a naive-bayes classifier on UNITE

This uses the example curation we performed above:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads uniteDB/sequences-filtered.qza \
    --i-reference-taxonomy uniteDB/taxonomy-no-SH.qza \
    --o-classifier uniteDB/classifier.qza

Evaluate the classifier

This can take a while

qiime rescript evaluate-fit-classifier \
    --i-sequences uniteDB/sequences-filtered.qza   \
    --i-taxonomy uniteDB/taxonomy-no-SH.qza \
    --p-n-jobs 2 \
    --o-classifier uniteDB/classifier.qza \
    --o-evaluation uniteDB/classifier-evaluation.qzv \
    --o-observed-taxonomy uniteDB/predicted-taxonomy.qza

qiime rescript evaluate-taxonomy \
  --i-taxonomies uniteDB/taxonomy-no-SH.qza uniteDB/predicted-taxonomy.qza \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats uniteDB/both-taxonomy-evaluation.qzv

Additional notes:

You can see all database variations we support by running:

qiime rescript get-unite-data --help

Want a different version? Open an issue on GitHub!

Now you're ready to use UNITE!
:qiime2:

SoilRotifer · January 4, 2024, 4:50pm

An off-topic reply has been split into a new topic: UNITE, Include certain species of fungi

Please keep replies on-topic in the future.

emmlemore · January 19, 2024, 2:24pm

Hi there,

Question about the 'Optional Curation' part. In the training classifiers tutorial, it mentions at the bottom that "fungal ITS classifiers trained on the UNITE reference database do NOT benefit from extracting/trimming reads to primer sites. We recommend training UNITE classifiers on the full reference sequences".

I am wondering if it's best to not train the classifier by adding the primer region? Thank you.

colinbrislawn · January 19, 2024, 3:14pm

Yes, that's what is being suggested here.

If you would like to try this, here are classifiers I've trained on the full UNITE database:

I'm not sure what will work best on your data!
You could use your positive controls to test this. (Did you included positive controls on the sequencing run.)

salias · May 16, 2024, 4:10pm

Hello,

Just thinking about the optional curation part, I don't know if I understood it properly so I have some questions. As far as I understood:

The first command (qiime taxa filter-seqs) removes sequences annotated in a too general manner (like fungi annotated simply as Fungi_sp). Regarding this, are these three ( Fungi_sp, mycota_sp, mycetes_sp) the only cases where this happens? Or maybe the most common cases in UNITE?
The second command (qiime rescript edit-taxonomy) removes the species hypotheses (SH) information from the taxonomy file. I don't get why we remove this information. Is it only because this identifier is redundand and removing it speeds up the process? Or maybe the reason is that removing them makes the classifier more "blind" when training? I'm a little bit lost here.
Finally, I saw that you trained and shared UNITE classifiers (many thanks for that!). Maybe it's a dumb question, but did you perform that optional curation step on those classifiers? I ask because if I'm going to use them I would like to know exactly the preprocessing applied to them.

Again, many thanks in advance

colinbrislawn · May 16, 2024, 7:58pm

Hello Sergio,

Sure, I'm happy to answer any questions you have.

Let's start here: Database creation is hard because there are many choices you can make, and you have to justify each choice.

I try to simplify as much as possible by making no choices.
data -> database

If I discover an issue, like a taxonomy label is spelled wrong, I could use this as justification to add extra steps to my process.
data -> edit taxonomy -> database

No. I used the full database unedited. These other functions are just examples.

Great news! Every Qiime2 .qza and .qzv file includes all the settings used to make everything inside of them!

You can put one of my files into https://view.qiime2.org,
then click on Provenance,
then click on a node,
then view the Action Details to see all settings used!

This also shows my full pipeline, which is: import taxonomy, import reads, and train with taxonomy and reads.

SoilRotifer · May 16, 2024, 8:35pm

I'd like to add one point to @colinbrislawn reply. Specifically regarding:

The text after sh__ is often considered not informative. That is, the question you'll have to answer for your self is: what information is this really conveying to you when you classify your sequences? If you can identify down to an sh__ label, what good is it really? What does it actually tell you?

For example, let's pretend that you have a query sequence that is an equally likely match to sequences with the following taxonomy:

SH0820801.10FU_MT590811_reps	k__Fungi;p__Ascomycota;c__Lecanoromycetes;o__Peltigerales;f__Peltigeraceae;g__Dendriscosticta;s__Dendriscosticta_sp;sh__SH0820801.10FU
SH0820820.10FU_MT590862_reps	k__Fungi;p__Ascomycota;c__Lecanoromycetes;o__Peltigerales;f__Peltigeraceae;g__Dendriscosticta;s__Dendriscosticta_sp;sh__SH0820820.10FU

Note these two taxonomy strings are both the referencing the same species: s__Dendriscosticta_sp. However, the sh__ portion of the string is different between these two, i.e. sh__SH0820801.10FU vs sh__SH0820820.10FU. Does this matter? It depends on what you're after.

Anyway, the taxonomy strings for both of these reference entries is identical only until sh__ portion of the string. As, in our pretend example, your query is an equally likely hit to both of these sh__ types. The classifier will not be able decide which one of these sh__ labels to use. Thus, the classifier will return the Lowest Common Ancestor (LCA) to resolve this, which would mean the classifier would only return:

k__Fungi;p__Ascomycota;c__Lecanoromycetes;o__Peltigerales;f__Peltigeraceae;g__Dendriscosticta;s__Dendriscosticta_sp

as the assigned taxonomy. Assuming you're lucky enough to obtain a species-level hit. The thought is, that this would happen so often that it is not worth using the sh__ labels. Especially, since most data would only be classified to the family or genus level anyway. Returning:

k__Fungi;p__Ascomycota;c__Lecanoromycetes;o__Peltigerales;f__Peltigeraceae

This is a justification for removing the sh__ labels: to reduce the size of the classifier, and increase the speed by which the classifier works (as you have less ranks to work through). There is nothing wrong with keeping sh__.

In fact, this is at the heart of RESCRIPt philosophy, curate the reference database to suite your needs. Everyone requires / likes to do things differently.

salias · May 17, 2024, 9:28am

Thank you @colinbrislawn and @SoilRotifer for your explanations and clarifications. I see it clear now

Nicholas_Bokulich · November 27, 2024, 6:49pm

3 off-topic replies have been split into a new topic: curation recommendations for UNITE database based on sequence length

Please keep replies on-topic in the future.

Lavendel · January 3, 2025, 10:23am

Hi all,

I’m working through some commands and noticed a potential overlap in outputs when using the following steps:

First, the classifier is trained with:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads uniteDB/sequences-filtered.qza \
    --i-reference-taxonomy uniteDB/taxonomy-no-SH.qza \
    --o-classifier uniteDB/classifier.qza

Later, I see a step to evaluate the classifier:

qiime rescript evaluate-fit-classifier \
    --i-sequences uniteDB/sequences-filtered.qza   \
    --i-taxonomy uniteDB/taxonomy-no-SH.qza \
    --p-n-jobs 2 \
    --o-classifier uniteDB/classifier.qza \
    --o-evaluation uniteDB/classifier-evaluation.qzv \
    --o-observed-taxonomy uniteDB/predicted-taxonomy.qza

Questions:

Why does it produce the same classifier.qza output as the fit-classifier-naive-bayes step?
Which classifier.qza (from fit-classifier-naive-bayes or evaluate-fit-classifier) should be used for classifying sequences?

Any insights would be greatly appreciated!

Thanks!

SoilRotifer · January 3, 2025, 2:37pm

Hi @Lavendel,

These are just two ways to generate a classifier. If you do not want to evaluate the database, then you can simply run fit-classifier-naive-bayes, but if you plan to evaluate the database you made, then it is better to simply make the classifier and evaluate it in one go, via evaluate-fit-classifier. The resulting classifiers will be identical regardless of which of these two actions you use. The latter action will take much longer due to the evaluation steps.