Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development. Please feel free to ask questions and provide feedback. Happy ing!
How to train a GTDB SSU classifier using RESCRIPt.
The Genome Taxonomy Database, is a great resource that strives to establish a standardized microbial taxonomy based on genome phylogeny. In this short tutorial we'll show you how to download the Small Sub-Unit (SSU) rRNA gene reference data from GTDB, and train a classifier using RESCRIPt.
If you use RESCRIPt, and any of the the associated GTDB data, in your research, please cite the following:
- Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
- Parks, Donovan H. and Chuvochina, Maria and Chaumeil, Pierre-Alain and Rinke, Christian and Mussig, Aaron J. and Hugenholtz, Philip. 2020. "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology. 38: 1079-1086. doi: 10.1038/s41587-020-0501-8
- Parks, Donovan H and Chuvochina, Maria and Rinke, Christian and Mussig, Aaron J and Chaumeil, Pierre-Alain and Hugenholtz, Philip. 2021. "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic Acids Research. 50: D785-D794. doi: 10.1093/nar/gkab776
Please see the GTDB "About" page for more details.
To run this tutorial you'll need:
- QIIME 2 version 2023.2 or later:
- Latest version of RESCRIPt.
Tutorial for prior version of `get-gtdb-data`
Tutorial
Download SSU reference taxonomy and sequences from GTDB.
GTDB currently provides SSU data for Bacteria and Archaea. By default, rescript get-gtdb-data
will download the reference sequence and taxonomy data for both domains of the latest known version of GTDB. For the example below, we'll define the optional --p-version
and --p-domain
parameters explicitly.
qiime rescript get-gtdb-data \
--p-version '214' \
--p-domain 'Both' \
--o-gtdb-taxonomy gtdb-214-both-tax.qza \
--o-gtdb-sequences gtdb-214-both-seqs.qza \
--verbose
Optional Curation
If you'd like to further curate the GTDB data you've downloaded you can look to the other RESCRIPt tutorials listed below for inspiration. For example, you might need to construct an amplicon-specific classifier.
- General RESCRIPt tutorial, using SILVA as an example.
- Extract sequence segments w/o PCR primers.
- Construct a taxonomy and sequence reference set from NCBI.
- Importing lower-case sequences using RDP as an example.
Train the GTDB classifier
For the sake of simplicity we'll forgo any curation and train our full-length GTDB SSU classifier!
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads gtdb-214-both-seqs.qza \
--i-reference-taxonomy gtdb-214-both-tax.qza \
--o-classifier gtdb-214-both-classifier.qza
Evaluate and Train
You can also train and evaluate your classifier simultaneously. That is, you can use the following command in place of the command immediately presented above.
qiime rescript evaluate-fit-classifier \
--i-sequences gtdb-214-both-seqs.qza \
--i-taxonomy gtdb-214-both-tax.qza \
--p-n-jobs 2 \
--o-classifier gtdb-214-both-classifier.qza \
--o-observed-taxonomy gtdb-214-both-predicted-taxonomy.qza \
--o-evaluation gtdb-214-both-classifier-evaluation.qzv
Let's also evaluate the taxonomy too.
qiime rescript evaluate-taxonomy \
--i-taxonomies gtdb-214-both-tax.qza gtdb-214-both-predicted-taxonomy.qza \
--p-labels ref-taxonomy predicted-taxonomy \
--o-taxonomy-stats gtdb-214-both-taxonomy-evaluation.qzv
Additional notes:
Download specific domain and version
If you only require a specific microbial domain, i.e. Bacteria
or Archaea
, you can do so by providing either of these as a value for --p-domain
. While we're at it, let's also provide an older version of GTDB, with the --p-version
option. Note: currently RESCRIPt allows access for the latest versions of GTDB: 202
, 207
, 214
.
qiime rescript get-gtdb-data \
--p-version '202' \
--p-domain 'Bacteria' \
--o-gtdb-taxonomy gtdb-202-bacteria-tax.qza \
--o-gtdb-sequences gtdb-202-bacteria-seqs.qza
A note on taxonomy information
You may note that there are far more entries within the downloaded taxonomy file compared to the corresponding SSU sequence file. This is becuase the taxonomy file contains information for all of the genome data contained within GTDB. That is, not all of the genome data contains available SSU sequences, either they're not present or not meeting GTDBs quality control standards. Thus, the SSU sequence files are a subset of the available genome / taxonomy data.
There is no need to worry about these extra taxonomy entries. However, if you'd like to remove these extra taxonomy entries, you can run the command below to keep only the taxonomy entries that match those within the SSU sequence file. Again, this filtering is not required for downstream taxonomic identification or constructing your classifier, as any excess taxonomy entries will be ignored anyway. You can use this filtered output gtdb-214-both-tax-filt.qza
in place of gtdb-214-both-tax.qza
for the above "evaluate-taxonomy" command. Future versions of RESCRPt will likely include options to download genome data too, thereby fully leveraging the available taxonomy, and other information that GTDB has to offer.
qiime rescript filter-taxa \
--i-taxonomy gtdb-214-both-tax.qza \
--m-ids-to-keep-file gtdb-214-both-seqs.qza \
--o-filtered-taxonomy gtdb-214-both-tax-filt.qza \
--verbose
Current Tutorial :
Download SSU reference taxonomy and sequences from GTDB.
GTDB currently provides two different versions of reference data, All and SpeciesReps. All contains SSU reference data that pass the quality-control of GTDB, but are not clustered into representative species. Both Archaea and Bacteria are contained within these non-clustered data. The second option (default) is the SpeciesReps reference data. The SpeciesReps contain the SSU gene sequences identified within the set of representative species for each Domain, i.e. Archaea and Bacteria, separately. These exist as separate files as different sets of genes are used to define relationships within each Domain. The respective SpeciesReps Domains, can be downloaded either separately or together.
By default, rescript get-gtdb-data
will download the SpeciesReps reference sequence and taxonomy data for both Bacteria and Archaea. For the example below, we'll define the parameters --p-version
, --p-db-type
, and --p-domain
parameters explicitly.
SpeciesReps
qiime rescript get-gtdb-data \
--p-version '214.1' \
--p-db-type 'SpeciesReps' \
--p-domain 'Both' \
--o-gtdb-taxonomy gtdb-214-both-tax.qza \
--o-gtdb-sequences gtdb-214-both-seqs.qza \
--verbose
Download specific domain and version of SpeciesReps
If you only require a specific microbial domain, i.e. Bacteria
or Archaea
, you can do so by providing either of these as a value for --p-domain
. While we're at it, let's also provide an older version of GTDB, with the --p-version
option. Note: currently RESCRIPt allows access for the latest versions of GTDB: 202.0
, 207.0
, 214.0
, 214.1
.
qiime rescript get-gtdb-data \
--p-version '202.0' \
--p-domain 'Bacteria' \
--o-gtdb-taxonomy gtdb-202-bacteria-tax.qza \
--o-gtdb-sequences gtdb-202-bacteria-seqs.qza
Non-SpeciesReps
Perhaps you'd like to curate the GTDB data yourself. In this case, you can simply run the command below to download the reference data that has not been clustered into species representatives. Note, when using --p-db-type 'All'
the --p-domain
flag is ignored, as GTDB does not maintain non-clustered reference data separately by Domain.
qiime rescript get-gtdb-data \
--p-version '214.1' \
--p-db-type 'All' \
--o-gtdb-taxonomy gtdb-214-nonspeciesrep-tax.qza \
--o-gtdb-sequences gtdb-214-nonspeciesrep-seqs.qza \
--verbose
Optional Curation
If you'd like to further curate the GTDB data you've downloaded you can look to the other RESCRIPt tutorials listed below for inspiration. For example, you might need to construct an amplicon-specific classifier.
- General RESCRIPt tutorial, using SILVA as an example.
- Extract sequence segments w/o PCR primers.
- Construct a taxonomy and sequence reference set from NCBI.
- Importing lower-case sequences using RDP as an example.
Train the GTDB classifier
For the sake of simplicity we'll forgo any curation and train our full-length GTDB SSU classifier! We'll continue with the SpeciesReps files.
qiime feature-classifier fit-classifier-naive-bayes \
--i-reference-reads gtdb-214-both-seqs.qza \
--i-reference-taxonomy gtdb-214-both-tax.qza \
--o-classifier gtdb-214-both-classifier.qza
Evaluate and Train
You can also train and evaluate your classifier simultaneously. That is, you can use the following command in place of the command immediately presented above.
qiime rescript evaluate-fit-classifier \
--i-sequences gtdb-214-both-seqs.qza \
--i-taxonomy gtdb-214-both-tax.qza \
--p-n-jobs 2 \
--o-classifier gtdb-214-both-classifier.qza \
--o-observed-taxonomy gtdb-214-both-predicted-taxonomy.qza \
--o-evaluation gtdb-214-both-classifier-evaluation.qzv
Let's also evaluate the taxonomy too.
qiime rescript evaluate-taxonomy \
--i-taxonomies gtdb-214-both-tax.qza gtdb-214-both-predicted-taxonomy.qza \
--p-labels ref-taxonomy predicted-taxonomy \
--o-taxonomy-stats gtdb-214-both-taxonomy-evaluation.qzv
Now you're ready to use GTDB for classifying your reads!
Happy -ing!