Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development. Please feel free to ask questions and provide feedback. Happy ing!
How to train a GTDB SSU classifier using RESCRIPt.
The Genome Taxonomy Database, is a great resource that strives to establish a standardized microbial taxonomy based on genome phylogeny. In this short tutorial we'll show you how to download the Small Sub-Unit (SSU) rRNA gene reference data from GTDB, and train a classifier using RESCRIPt.
If you use RESCRIPt, and any of the the associated GTDB data, in your research, please cite the following:
- Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
- Parks, Donovan H. and Chuvochina, Maria and Chaumeil, Pierre-Alain and Rinke, Christian and Mussig, Aaron J. and Hugenholtz, Philip. 2020. "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology. 38: 1079-1086. doi: 10.1038/s41587-020-0501-8
- Parks, Donovan H and Chuvochina, Maria and Rinke, Christian and Mussig, Aaron J and Chaumeil, Pierre-Alain and Hugenholtz, Philip. 2021. "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic Acids Research. 50: D785-D794. doi: 10.1093/nar/gkab776
Please see the GTDB "About" page for more details.
To run this tutorial you'll need:
- QIIME 2 version 2023.2 or later:
- Latest version of RESCRIPt.
Download SSU reference taxonomy and sequences from GTDB.
GTDB currently provides SSU data for Bacteria and Archaea. By default,
rescript get-gtdb-data will download the reference sequence and taxonomy data for both domains of the latest known version of GTDB. For the exmaple below, we'll define the optional
--p-domain parameters explicitly.
qiime rescript get-gtdb-data \ --p-version '207' \ --p-domain 'Both' \ --o-gtdb-taxonomy gtdb-207-both-tax.qza \ --o-gtdb-sequences gtdb-207-both-seqs.qza \ --verbose
If you'd like to further curate the GTDB data you've downloaded you can look to the other RESCRIPt tutorials listed below for inspiration. For example, you might need to construct an amplicon-specific classifier.
- General RESCRIPt tutorial, using SILVA as an example.
- Extract sequence segments w/o PCR primers.
- Construct a taxonomy and sequence reference set from NCBI.
- Importing lower-case sequences using RDP as an example.
Train the GTDB classifier
For the sake of simplicity we'll forgoe any curation and train our full-length GTDB SSU classifier!
qiime feature-classifier fit-classifier-naive-bayes \ --i-reference-reads gtdb-207-both-seqs.qza \ --i-reference-taxonomy gtdb-207-both-tax.qza \ --o-classifier gtdb-207-both-classifier.qza
Evaluate and Train
You can also train and evaluate your classifier simultaneously. That is, you can use the following command in place of the command immediately presented above.
qiime rescript evaluate-fit-classifier \ --i-sequences gtdb-207-both-seqs.qza \ --i-taxonomy gtdb-207-both-tax.qza \ --p-n-jobs 2 \ --o-classifier gtdb-207-both-classifier.qza \ --o-observed-taxonomy gtdb-207-both-predicted-taxonomy.qza \ --o-evaluation gtdb-207-both-classifier-evaluation.qzv
Let's also evaluate the taxonomy too.
qiime rescript evaluate-taxonomy \ --i-taxonomies gtdb-207-both-tax-filt.qza gtdb-207-both-predicted-taxonomy.qza \ --p-labels ref-taxonomy predicted-taxonomy \ --o-taxonomy-stats gtdb-207-both-taxonomy-evaluation.qzv
Download specific domain and version
If you only require a specific microbial domain, i.e.
Archaea, you can do so by providing either of these as a value for
--p-domain. While we're at it, let's also provide an older version of GTDB, with the
--p-version option. Note: currently RESCRIPt allows access for the two latest versions of GTDB:
qiime rescript get-gtdb-data \ --p-version '202' \ --p-domain 'Bacteria' \ --o-gtdb-taxonomy gtdb-202-bacteria-tax.qza \ --o-gtdb-sequences gtdb-202-bacteria-seqs.qza
A note on taxonomy information
You may note that there are far more entries within the downloaded taxonomy file compared to the corresponding SSU sequence file. This is becuase the taxonomy file contains information for all of the genome data contained within GTDB. That is, not all of the genome data contains available SSU sequences, either they're not present or not meeting GTDBs quality control standards. Thus, the SSU sequence files are a subset of the available genome / taxonomy data.
There is no need to worry about these extra taxonomy entries. However, if you'd like to remove these extra taxonomy entries, you can run the command below to keep only the taxonomy entries that match those within the SSU sequence file. Again, this filtering is not required for downstram taxonomic identification or constructing your classifier, as any excess taxonomy entries will be ignored anyway. Future versions of RESCRPt will likely include options to download genome data too, thereby fully leverageing the available taxonomy, and other information that GTDB has to offer.
qiime rescript filter-taxa \ --i-taxonomy gtdb-207-both-tax.qza \ --m-ids-to-keep-file gtdb-207-both-seqs.qza \ --o-filtered-taxonomy gtdb-207-both-tax-filt.qza \ --verbose
Now you're ready to use GTDB for classifying your reads!