How to train a GTDB SSU classifier using RESCRIPt

:construction: Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development. Please feel free to ask questions and provide feedback. Happy :qiime2:ing!

How to train a GTDB SSU classifier using RESCRIPt.

The Genome Taxonomy Database, is a great resource that strives to establish a standardized microbial taxonomy based on genome phylogeny. In this short tutorial we'll show you how to download the Small Sub-Unit (SSU) rRNA gene reference data from GTDB, and train a classifier using RESCRIPt.

If you use RESCRIPt, and any of the the associated GTDB data, in your research, please cite the following:

  • Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
  • Parks, Donovan H. and Chuvochina, Maria and Chaumeil, Pierre-Alain and Rinke, Christian and Mussig, Aaron J. and Hugenholtz, Philip. 2020. "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology. 38: 1079-1086. doi: 10.1038/s41587-020-0501-8
  • Parks, Donovan H and Chuvochina, Maria and Rinke, Christian and Mussig, Aaron J and Chaumeil, Pierre-Alain and Hugenholtz, Philip. 2021. "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic Acids Research. 50: D785-D794. doi: 10.1093/nar/gkab776

Please see the GTDB "About" page for more details.

To run this tutorial you'll need:

  • QIIME 2 version 2023.2 or later:
  • Latest version of RESCRIPt.

Tutorial for prior version of `get-gtdb-data`

Tutorial :train:

Download SSU reference taxonomy and sequences from GTDB. :inbox_tray:

GTDB currently provides SSU data for Bacteria and Archaea. By default, rescript get-gtdb-data will download the reference sequence and taxonomy data for both domains of the latest known version of GTDB. For the example below, we'll define the optional --p-version and --p-domain parameters explicitly.

qiime rescript get-gtdb-data \
    --p-version '214' \
    --p-domain 'Both' \
    --o-gtdb-taxonomy gtdb-214-both-tax.qza \
    --o-gtdb-sequences gtdb-214-both-seqs.qza \
    --verbose

Optional Curation :hammer_and_wrench:

If you'd like to further curate the GTDB data you've downloaded you can look to the other RESCRIPt tutorials listed below for inspiration. For example, you might need to construct an amplicon-specific classifier.

Train the GTDB classifier :teacher:

For the sake of simplicity we'll forgo any curation and train our full-length GTDB SSU classifier!

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads gtdb-214-both-seqs.qza \
    --i-reference-taxonomy gtdb-214-both-tax.qza \
    --o-classifier gtdb-214-both-classifier.qza

Evaluate and Train :bar_chart:

You can also train and evaluate your classifier simultaneously. That is, you can use the following command in place of the command immediately presented above.

qiime rescript evaluate-fit-classifier \
    --i-sequences gtdb-214-both-seqs.qza  \
    --i-taxonomy gtdb-214-both-tax.qza \
    --p-n-jobs 2 \
    --o-classifier gtdb-214-both-classifier.qza \
    --o-observed-taxonomy gtdb-214-both-predicted-taxonomy.qza \
    --o-evaluation gtdb-214-both-classifier-evaluation.qzv

Let's also evaluate the taxonomy too.

qiime rescript evaluate-taxonomy \
  --i-taxonomies gtdb-214-both-tax-filt.qza gtdb-214-both-predicted-taxonomy.qza \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats gtdb-214-both-taxonomy-evaluation.qzv

Additional notes: :spiral_notepad:

Download specific domain and version

If you only require a specific microbial domain, i.e. Bacteria or Archaea, you can do so by providing either of these as a value for --p-domain. While we're at it, let's also provide an older version of GTDB, with the --p-version option. Note: currently RESCRIPt allows access for the latest versions of GTDB: 202, 207, 214.

qiime rescript get-gtdb-data \
    --p-version '202' \
    --p-domain 'Bacteria' \
    --o-gtdb-taxonomy gtdb-202-bacteria-tax.qza \
    --o-gtdb-sequences gtdb-202-bacteria-seqs.qza

A note on taxonomy information

You may note that there are far more entries within the downloaded taxonomy file compared to the corresponding SSU sequence file. This is becuase the taxonomy file contains information for all of the genome data contained within GTDB. That is, not all of the genome data contains available SSU sequences, either they're not present or not meeting GTDBs quality control standards. Thus, the SSU sequence files are a subset of the available genome / taxonomy data.

There is no need to worry about these extra taxonomy entries. However, if you'd like to remove these extra taxonomy entries, you can run the command below to keep only the taxonomy entries that match those within the SSU sequence file. Again, this filtering is not required for downstram taxonomic identification or constructing your classifier, as any excess taxonomy entries will be ignored anyway. Future versions of RESCRPt will likely include options to download genome data too, thereby fully leverageing the available taxonomy, and other information that GTDB has to offer.

qiime rescript filter-taxa \
    --i-taxonomy gtdb-214-both-tax.qza \
    --m-ids-to-keep-file gtdb-214-both-seqs.qza \
    --o-filtered-taxonomy gtdb-214-both-tax-filt.qza \
    --verbose

Current Tutorial :train: :

Download SSU reference taxonomy and sequences from GTDB. :inbox_tray:

GTDB currently provides two different versions of reference data, All and SpeciesReps. All contains SSU reference data that pass the quality-control of GTDB, but are not clustered into representative species. Both Archaea and Bacteria are contained within these non-clustered data. The second option (default) is the SpeciesReps reference data. The SpeciesReps contain the SSU gene sequences identified within the set of representative species for each Domain, i.e. Archaea and Bacteria, separately. These exist as separate files as different sets of genes are used to define relationships within each Domain. The respective SpeciesReps Domains, can be downloaded either separately or together.

By default, rescript get-gtdb-data will download the SpeciesReps reference sequence and taxonomy data for both Bacteria and Archaea. For the example below, we'll define the parameters --p-version, --p-db-type, and --p-domain parameters explicitly.

SpeciesReps

qiime rescript get-gtdb-data \
    --p-version '214.1' \
    --p-db-type 'SpeciesReps' \
    --p-domain 'Both' \
    --o-gtdb-taxonomy gtdb-214-both-tax.qza \
    --o-gtdb-sequences gtdb-214-both-seqs.qza \
    --verbose

Download specific domain and version of SpeciesReps

If you only require a specific microbial domain, i.e. Bacteria or Archaea, you can do so by providing either of these as a value for --p-domain. While we're at it, let's also provide an older version of GTDB, with the --p-version option. Note: currently RESCRIPt allows access for the latest versions of GTDB: 202.0, 207.0, 214.0, 214.1.

qiime rescript get-gtdb-data \
    --p-version '202.0' \
    --p-domain 'Bacteria' \
    --o-gtdb-taxonomy gtdb-202-bacteria-tax.qza \
    --o-gtdb-sequences gtdb-202-bacteria-seqs.qza

Non-SpeciesReps
Perhaps you'd like to curate the GTDB data yourself. In this case, you can simply run the command below to download the reference data that has not been clustered into species representatives. Note, when using --p-db-type 'All' the --p-domain flag is ignored, as GTDB does not maintain non-clustered reference data separately by Domain.

qiime rescript get-gtdb-data \
    --p-version '214.1' \
    --p-db-type 'All' \
    --o-gtdb-taxonomy gtdb-214-nonspeciesrep-tax.qza \
    --o-gtdb-sequences gtdb-214-nonspeciesrep-seqs.qza \
    --verbose

Optional Curation :hammer_and_wrench:

If you'd like to further curate the GTDB data you've downloaded you can look to the other RESCRIPt tutorials listed below for inspiration. For example, you might need to construct an amplicon-specific classifier.

Train the GTDB classifier :teacher:

For the sake of simplicity we'll forgo any curation and train our full-length GTDB SSU classifier! We'll continue with the SpeciesReps files.

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads gtdb-214-both-seqs.qza \
    --i-reference-taxonomy gtdb-214-both-tax.qza \
    --o-classifier gtdb-214-both-classifier.qza

Evaluate and Train :bar_chart:

You can also train and evaluate your classifier simultaneously. That is, you can use the following command in place of the command immediately presented above.

qiime rescript evaluate-fit-classifier \
    --i-sequences gtdb-214-both-seqs.qza  \
    --i-taxonomy gtdb-214-both-tax.qza \
    --p-n-jobs 2 \
    --o-classifier gtdb-214-both-classifier.qza \
    --o-observed-taxonomy gtdb-214-both-predicted-taxonomy.qza \
    --o-evaluation gtdb-214-both-classifier-evaluation.qzv

Let's also evaluate the taxonomy too.

qiime rescript evaluate-taxonomy \
  --i-taxonomies gtdb-214-both-tax-filt.qza gtdb-214-both-predicted-taxonomy.qza \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats gtdb-214-both-taxonomy-evaluation.qzv

:fireworks: Now you're ready to use GTDB for classifying your reads! :tada:

Happy :qiime2: -ing!

6 Likes

An off-topic reply has been split into a new topic: Using GTDB to classify species level

Please keep replies on-topic in the future.