How to train a GTDB SSU classifier using RESCRIPt

:construction: Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development. Please feel free to ask questions and provide feedback. Happy :qiime2:ing!

How to train a GTDB SSU classifier using RESCRIPt.

The Genome Taxonomy Database, is a great resource that strives to establish a standardized microbial taxonomy based on genome phylogeny. In this short tutorial we'll show you how to download the Small Sub-Unit (SSU) rRNA gene reference data from GTDB, and train a classifier using RESCRIPt.

If you use RESCRIPt, and any of the the associated GTDB data, in your research, please cite the following:

  • Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
  • Parks, Donovan H. and Chuvochina, Maria and Chaumeil, Pierre-Alain and Rinke, Christian and Mussig, Aaron J. and Hugenholtz, Philip. 2020. "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology. 38: 1079-1086. doi: 10.1038/s41587-020-0501-8
  • Parks, Donovan H and Chuvochina, Maria and Rinke, Christian and Mussig, Aaron J and Chaumeil, Pierre-Alain and Hugenholtz, Philip. 2021. "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic Acids Research. 50: D785-D794. doi: 10.1093/nar/gkab776

Please see the GTDB "About" page for more details.

To run this tutorial you'll need:

  • QIIME 2 version 2023.2 or later:
  • Latest version of RESCRIPt.

Tutorial :train:

Download SSU reference taxonomy and sequences from GTDB. :inbox_tray:

GTDB currently provides SSU data for Bacteria and Archaea. By default, rescript get-gtdb-data will download the reference sequence and taxonomy data for both domains of the latest known version of GTDB. For the exmaple below, we'll define the optional --p-version and --p-domain parameters explicitly.

qiime rescript get-gtdb-data \
    --p-version '207' \
    --p-domain 'Both' \
    --o-gtdb-taxonomy gtdb-207-both-tax.qza \
    --o-gtdb-sequences gtdb-207-both-seqs.qza \

Optional Curation :hammer_and_wrench:

If you'd like to further curate the GTDB data you've downloaded you can look to the other RESCRIPt tutorials listed below for inspiration. For example, you might need to construct an amplicon-specific classifier.

Train the GTDB classifier :teacher:

For the sake of simplicity we'll forgoe any curation and train our full-length GTDB SSU classifier!

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads gtdb-207-both-seqs.qza \
    --i-reference-taxonomy gtdb-207-both-tax.qza \
    --o-classifier gtdb-207-both-classifier.qza

Evaluate and Train :bar_chart:

You can also train and evaluate your classifier simultaneously. That is, you can use the following command in place of the command immediately presented above.

qiime rescript evaluate-fit-classifier \
    --i-sequences gtdb-207-both-seqs.qza  \
    --i-taxonomy gtdb-207-both-tax.qza \
    --p-n-jobs 2 \
    --o-classifier gtdb-207-both-classifier.qza \
    --o-observed-taxonomy gtdb-207-both-predicted-taxonomy.qza \
    --o-evaluation gtdb-207-both-classifier-evaluation.qzv

Let's also evaluate the taxonomy too.

qiime rescript evaluate-taxonomy \
  --i-taxonomies gtdb-207-both-tax-filt.qza gtdb-207-both-predicted-taxonomy.qza \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats gtdb-207-both-taxonomy-evaluation.qzv

Additional notes: :spiral_notepad:

Download specific domain and version

If you only require a specific microbial domain, i.e. Bacteria or Archaea, you can do so by providing either of these as a value for --p-domain. While we're at it, let's also provide an older version of GTDB, with the --p-version option. Note: currently RESCRIPt allows access for the two latest versions of GTDB: 202 and 207.

qiime rescript get-gtdb-data \
    --p-version '202' \
    --p-domain 'Bacteria' \
    --o-gtdb-taxonomy gtdb-202-bacteria-tax.qza \
    --o-gtdb-sequences gtdb-202-bacteria-seqs.qza

A note on taxonomy information

You may note that there are far more entries within the downloaded taxonomy file compared to the corresponding SSU sequence file. This is becuase the taxonomy file contains information for all of the genome data contained within GTDB. That is, not all of the genome data contains available SSU sequences, either they're not present or not meeting GTDBs quality control standards. Thus, the SSU sequence files are a subset of the available genome / taxonomy data.

There is no need to worry about these extra taxonomy entries. However, if you'd like to remove these extra taxonomy entries, you can run the command below to keep only the taxonomy entries that match those within the SSU sequence file. Again, this filtering is not required for downstram taxonomic identification or constructing your classifier, as any excess taxonomy entries will be ignored anyway. Future versions of RESCRPt will likely include options to download genome data too, thereby fully leverageing the available taxonomy, and other information that GTDB has to offer.

qiime rescript filter-taxa \
    --i-taxonomy gtdb-207-both-tax.qza \
    --m-ids-to-keep-file gtdb-207-both-seqs.qza \
    --o-filtered-taxonomy gtdb-207-both-tax-filt.qza \

:fireworks: Now you're ready to use GTDB for classifying your reads! :tada:

Happy :qiime2: -ing!