Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

:construction: Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development.

RESCRIPt

RESCRIPt (REference Sequence annotation and CuRatIon Pipeline) is a python package and QIIME 2 plugin for formatting, managing, and manipulating sequence reference databases. This package was designed for compiling, manipulating, and evaluating sequence reference databases from SILVA, NCBI, Greengenes, GTDB, and other sources, and for constructing reference databases for use with QIIME 2 (or other microbiome analysis software and taxonomy classifiers). This tutorial describes several primary pipelines and actions available in RESCRIPt, with a focus on the SILVA 16S rRNA gene database.


:exclamation:Note: This tutorial focuses on use of RESCRIPt with SILVA data; see here for other tutorials using NCBI, COI data, and more! :scream:


Citation:

If you use RESCRIPt or any RESCRIPt-processed data in your research, please cite the following:

  • Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581

If you make use of SILVA (the example database highlighted in this tutorial), please be sure to cite the following in your work too:

  • Pruesse, Elmar, Christian Quast, Katrin Knittel, Bernhard M. Fuchs, Wolfgang Ludwig, Jörg Peplies, and Frank Oliver Glöckner. 2007. “SILVA: A Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data Compatible with ARB.” Nucleic Acids Research 35 (21): 7188–96. doi: 10.1093/nar/gkm864

  • Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2013. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research 41: D590–96. doi: 10.1093/nar/gks1219

:fireworks: Provenance tracked sequence database generation. Woohoo! :fireworks:

Table of Contents

  1. Preparing the SILVA reference database
    a. Getting SILVA data the easy way
    b. “Culling” low-quality sequences with cull-seqs
    c. Filtering sequences by length and taxonomy
    d. Dereplication of sequences and taxonomy
    e. Make amplicon-region specific classifier
  2. Database (and sequence) evaluation functions
    a. Sequence database evaluation with the evaluate-* family of actions
    b. Evaluate classification accuracy with evaluate-classifications
    c. Evaluate taxonomic information with evaluate-taxonomy
  3. General sequence and taxonomy data operations
    a. Filtering sequences by length
    b. Orient sequences by alignment to reference
    c. Reverse Transcribe
    d. De-gapping alignments
    e. Editing taxonomy
  4. Merging taxonomy results
  5. Visualization gallery

:partying_face:
Note: The tutorial below uses primarily SILVA data by way of comparison, but note that all of the steps demonstrated beyond step 1a can be applied to any sequence reference data. Feel free to modify any of the steps in this tutorial, or their order, to best suit your needs!

Preparing the SILVA reference database

We'll use RESCRIPt to prepare a QIIME 2 compatible SSU SILVA reference database based on the curated NR99 (version 138.1) database. We chose this version mainly for the reasons outlined here.

SILVA compilation pipeline

  • Below is a simple example outline of the steps involved for constructing a QIIME 2 compatible reference from SILVA.
  • Begin by downloading the relevant taxonomy and sequence files from the SILVA.
  • Import these files into QIIME 2.
  • Prepare a fixed-rank taxonomy file.
  • Remove sequences with excessive degenerate bases and homopolymers.
  • Remove sequences that may be too short and/or long. With the option to condition the length filtering based on taxonomy.
  • Dereplicate the sequences and taxonomy.
  • Build our classifier.

:warning:
NOTE: pre-processed SILVA sequence and taxonomy Artifacts (and taxonomy classifiers) have been generated and released by the QIIME 2 team, following the same steps described below. You can get the pre-processed Artifacts and classifiers here: Data resources — QIIME 2 2024.2.0 documentation

For purposes of this tutorial, we’ll use the current SILVA SSU release (version 138.1). There are two approaches you can use to import and process the SILVA reference taxonomy and sequences for use in QIIME 2. We’ll start with “Getting SILVA data the easy way”. However, we recommend that you at least read through the "hard-way" steps to understand what RESCRIPt is doing, and how the SILVA taxonomy is parsed).

Getting SILVA data: Hard Mode.

Click on the triangle below for more details.

The gritty details

Download SILVA files

First, we’ll need to go to the SILVA v138.1 archive to obtain:

the following taxonomy files:

  • tax_slv_ssu_138.1.txt.gz
  • taxmap_slv_ssu_ref_nr_138.1.txt.gz
  • tax_slv_ssu_138.1.tre.gz
    the sequence file:
  • SILVA_138.1_SSURef_Nr99_tax_silva_trunc.fasta.gz

You can download the files through your browser directly. We’ll make use of wget to download the files from the command line, then gunzip these files prior to importing into QIIME 2.

Download the Taxonomy Rank file. This maps the taxonomic rank and taxonomy to the taxid.

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/tax_slv_ssu_138.1.txt.gz

gunzip tax_slv_ssu_138.1.txt.gz

Download the Taxonomy Map file. This maps the sequence Accessions to the Organism Name and Taxonomy IDs.

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.1.txt.gz

gunzip taxmap_slv_ssu_ref_nr_138.1.txt.gz

Download the Taxonomy Tree file. This file contains the hierarchical relationship of the taxonomy IDs in tree form.

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/tax_slv_ssu_138.1.tre.gz

gunzip tax_slv_ssu_138.1.tre.gz

Download the SILVA NR99 sequences (non-redundant and unaligned)

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz

gunzip SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz

Import SILVA files into QIIME 2

Import the Taxonomy Rank file:

qiime tools import \
    --type 'FeatureData[SILVATaxonomy]' \
    --input-path tax_slv_ssu_138.1.txt \
    --output-path taxranks-silva-138.1-ssu-nr99.qza 

Import the Taxonomy Mapping file

qiime tools import \
    --type 'FeatureData[SILVATaxidMap]' \
    --input-path taxmap_slv_ssu_ref_nr_138.1.txt \
    --output-path taxmap-silva-138.1-ssu-nr99.qza 

Import the Taxonomy Hierarchy Tree file:

qiime tools import \
    --type 'Phylogeny[Rooted]' \
    --input-path tax_slv_ssu_138.1.tre \
    --output-path taxtree-silva-138.1-nr99.qza 

Import the sequence file:

qiime tools import \
    --type 'FeatureData[RNASequence]' \
    --input-path SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta \
    --output-path silva-138.1-ssu-nr99-rna-seqs.qza

Note, the data exist within SILVA as RNA sequences, and thus have been imported as FeatureData[RNASequence]. To make sure things run smoothly downstream we'll convert the data to FeatureData[DNASequence] like so:

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
    --o-dna-sequences silva-138.1-ssu-nr99-seqs.qza

We are now ready to proceed with making our SILVA reference database within QIIME 2. First we’ll need to prepare the silva taxonomy prior to use. We’ll use parse-silva-taxonomy to do this. You can optionally include the --p-include-species-labels flag. But be wary, there are species label annotations that may be spurious! See the caveats about using species-labels later under the hidden menu labeled "Species-labels: caveat emptor!" below.

qiime rescript parse-silva-taxonomy \
    --i-taxonomy-tree taxtree-silva-138.1-nr99.qza \
    --i-taxonomy-map taxmap-silva-138.1-ssu-nr99.qza \
    --i-taxonomy-ranks taxranks-silva-138.1-ssu-nr99.qza \
    --o-taxonomy silva-138.1-ssu-nr99-tax.qza

Great work! We now have a properly formatted QIIME 2 compatible SILVA taxonomy and sequence files. We’ll use these for the remaining downstream steps.

:arrow_up: Return to Table of Contents

Getting SILVA data the easy way

We actually have a nice convenience method to perform all the tasks outlined in the Hard Mode section, using a single command!* Which is… :drum: :

qiime rescript get-silva-data \
    --p-version '138.1' \
    --p-target 'SSURef_NR99' \
    --o-silva-sequences silva-138.1-ssu-nr99-rna-seqs.qza \
    --o-silva-taxonomy silva-138.1-ssu-nr99-tax.qza

If you'd like to be able to jump to steps that only take FeatureData[Sequence] as input you can convert your data to FeatureData[Sequence] like so:

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza 
    --o-dna-sequences silva-138.1-ssu-nr99-seqs.qza

This pipeline will essentially perform all of the steps we outlined above in Hard Mode and return a ready-to-use sequence and taxonomy file for any other curation steps you’d like to perform.

Tip: If you'd like to select specific ranks for your SILVA taxonomy, simply add the --p-ranks flag to the either the parse-silva-taxonomy or get-silva-data commands. For example:
--p-ranks domain kingdom phylum class order suborder family subfamily genus

We highly recommend that you perform some additional quality-control on these files, as we’ll outline below. Feel free to alter the pipeline that best suits your needs. For more information please click on the triangles below to reveal more details:

Rank-Propagation

When using either parse-silva-taxonomy or get-silva-data, you'll notice that there is an option to "propagate" the taxonomy, i.e. (--p-rank-propagation or --p-no-rank-propagation). Here, we'll describe what rank propagation is doing.

RESCRIPt parses the taxonomic rank and lineage information from SILVA as shown in the first table below. Both the rank and its associated prefix are shown in the column headers. Later tables only use the rank prefixes.

All available rank information parsed by RESCRIPt (non-rank-propagated):

id domain (d__) superkingdom (sk__) kingdom (k__) subkingdom (ks__) superphylum (sp__) phylum (p__) subphylum (ps__) infraphylum (pi__) superclass (sc__) class (c__) subclass (cs__) infraclass (ci__) superorder (so__) order (o__) suborder (os__) superfamily (sf__) family (f__) subfamily (fs__) genus (g__)
AB671439.1.2071 d__Eukaryota sk__Nucletmycea k__Fungi ks__Dikarya sp__ p__Ascomycota ps__Pezizomycotina pi__ sc__ c__ cs__ ci__ so__ o__ os__ sf__ f__ fs__ g__
Z27393.1.1722 d__Eukaryota sk__Nucletmycea k__Fungi ks__Dikarya sp__ p__Ascomycota ps__Taphrinomycotina pi__ sc__ c__ cs__ ci__ so__ o__ os__ sf__ f__ fs__ g__
KY886363.1.1791 d__Eukaryota sk__Holozoa k__Animalia ks__ sp__Lophotrochozoa p__Rotifera ps__ pi__ sc__ c__Monogononta cs__ ci__ so__ o__Ploimida os__ sf__ f__ fs__ g__
AJ544654.1.1712 d__Eukaryota sk__ k__Stramenopiles ks__ sp__Ochrophyta p__Diatomea ps__Bacillariophytina pi__ sc__ c__Bacillariophyceae cs__ ci__ so__ o__ os__ sf__ f__Sellaphoraceae fs__ g__Sellaphora
A16379.1.1485 d__Bacteria sk__ k__ ks__ sp__ p__Proteobacteria ps__ pi__ sc__ c__Gammaproteobacteria cs__ ci__ so__ o__Pasteurellales os__ sf__ f__Pasteurellaceae fs__ g__Haemophilus
AJ888030.1.996 d__Archaea sk__ k__ ks__ sp__ p__Crenarchaeota ps__ pi__ sc__ c__Thermoprotei cs__ ci__ so__ o__Sulfolobales os__ sf__ f__Sulfolobaceae fs__ g__Acidianus

Note the inconsistancy of available rank-lineage information across the different organisms. If we decide to not propagate the taxonomy (i.e. the --p-no-rank-propagation flag ), and wish to only obtain the standard "6-rank" taxonomy (i.e. domain, phylum, class, order, family, genus). Then our resulting reference taxonomy would look like:

id d__ p__ c__ o__ f__ g__
AB671439.1.2071 d__Eukaryota p__Ascomycota c__ o__ f__ g__
Z27393.1.1722 d__Eukaryota p__Ascomycota c__ o__ f__ g__
KY886363.1.1791 d__Eukaryota p__Rotifera c__Monogononta o__Ploimida f__ g__
AJ544654.1.1712 d__Eukaryota p__Diatomea c__Bacillariophytina o__ f__Sellaphoraceae g__Sellaphora
A16379.1.1485 d__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Pasteurellales f__Pasteurellaceae g__Haemophilus
AJ888030.1.996 d__Archaea p__Crenarchaeota c__Thermoprotei o__Sulfolobales f__Sulfolobaceae g__Acidianus

However, if we use rank-propagation (i.e. the default --p--rank-propagation) the full rank-lineage table above will become:

Full rank-propagated taxonomy:

id d__ sk__ k__ ks__ sp__ p__ ps__ pi__ sc__ c__ cs__ ci__ so__ o__ os__ sf__ f__ fs__ g__
AB671439.1.2071 d__Eukaryota sk__Nucletmycea k__Fungi ks__Dikarya sp__Dikarya p__Ascomycota ps__Pezizomycotina pi__Pezizomycotina sc__Pezizomycotina c__Pezizomycotina cs__Pezizomycotina ci__Pezizomycotina so__Pezizomycotina o__Pezizomycotina os__Pezizomycotina sf__Pezizomycotina f__Pezizomycotina fs__Pezizomycotina g__Pezizomycotina
Z27393.1.1722 d__Eukaryota sk__Nucletmycea k__Fungi ks__Dikarya sp__Dikarya p__Ascomycota ps__Taphrinomycotina pi__Taphrinomycotina sc__Taphrinomycotina c__Taphrinomycotina cs__Taphrinomycotina ci__Taphrinomycotina so__Taphrinomycotina o__Taphrinomycotina os__Taphrinomycotina sf__Taphrinomycotina f__Taphrinomycotina fs__Taphrinomycotina g__Taphrinomycotina
KY886363.1.1791 d__Eukaryota sk__Holozoa k__Animalia ks__Animalia sp__Lophotrochozoa p__Rotifera ps__Rotifera pi__Rotifera sc__Rotifera c__Monogononta cs__Monogononta ci__Monogononta so__Monogononta o__Ploimida os__Ploimida sf__Ploimida f__Ploimida fs__Ploimida g__Ploimida
AJ544654.1.1712 d__Eukaryota sk__Eukaryota k__Stramenopiles ks__Stramenopiles sp__Ochrophyta p__Diatomea ps__Bacillariophytina pi__Bacillariophytina sc__Bacillariophytina c__Bacillariophyceae cs__Bacillariophyceae ci__Bacillariophyceae so__Bacillariophyceae o__Bacillariophyceae os__Bacillariophyceae sf__Bacillariophyceae f__Sellaphoraceae fs__Sellaphoraceae g__Sellaphora
A16379.1.1485 d__Bacteria sk__Bacteria k__Bacteria ks__Bacteria sp__Bacteria p__Proteobacteria ps__Proteobacteria pi__Proteobacteria sc__Proteobacteria c__Gammaproteobacteria cs__Gammaproteobacteria ci__Gammaproteobacteria so__Gammaproteobacteria o__Pasteurellales os__Pasteurellales sf__Pasteurellales f__Pasteurellaceae fs__Pasteurellaceae g__Haemophilus
AJ888030.1.996 d__Archaea sk__Archaea k__Archaea ks__Archaea sp__Archaea p__Crenarchaeota ps__Crenarchaeota pi__Crenarchaeota sc__Crenarchaeota c__Thermoprotei cs__Thermoprotei ci__Thermoprotei so__Thermoprotei o__Sulfolobales os__Sulfolobales sf__Sulfolobales f__Sulfolobaceae fs__Sulfolobaceae g__Acidianus

If we subset our 6-ranks from here, we'll return the following "6-rank" propagated table:

id d__ p__ c__ o__ f__ g__
AB671439.1.2071 d__Eukaryota p__Ascomycota c__Pezizomycotina o__Pezizomycotina f__Pezizomycotina g__Pezizomycotina
Z27393.1.1722 d__Eukaryota p__Ascomycota c__Taphrinomycotina o__Taphrinomycotina f__Taphrinomycotina g__Taphrinomycotina
KY886363.1.1791 d__Eukaryota p__Rotifera c__Monogononta o__Ploimida f__Ploimida g__Ploimida
AJ544654.1.1712 d__Eukaryota p__Diatomea c__Bacillariophyceae o__Bacillariophyceae f__Sellaphoraceae g__Sellaphora
A16379.1.1485 d__Bacteria p__Proteobacteria c__Gammaproteobacteria o__Pasteurellales f__Pasteurellaceae g__Haemophilus
AJ888030.1.996 d__Archaea p__Crenarchaeota c__Thermoprotei o__Sulfolobales f__Sulfolobaceae g__Acidianus

Notice anything interesting in our new "6-rank" taxonomy table? Where did those other lineage names come from? In the case that rank-propagation is used, ranks are propagated (forward-filled) using all available rank-taxonomy information present, prior to sub-setting the taxonomic rank information requested by the user. This is why the two "6-rank" taxonomy tables are slightly different. Rank-propagation helps when taxa do not share consistent rank-lineage information, as evidenced by observing the the full rank-lineage tables above. For example, some taxa have annotations for "superphylum" and other ranks which are not used / annotated. If we only propagated taxonomy after subsetting the ranks, then the resulting reference taxonomy for many organisms would be missing. Note the information (however limited) we gain from propagating the taxonomy for the fungal taxa.

The key take away here is that any ranks not associated with a taxonomic lineage have their upper-level taxonomy propagated downward (i.e. the values are forward filled with the last observed taxonomic value) towards lower-level ranks. This ensures general compatibility with downstream taxonomy classification tools, many of which may require non-empty fields at each rank.

Species-labels: caveat emptor!

:warning: Be wary of applying the --include-species-labels flag! This flag simply takes the source organism name of a given sequence and appends it as the "species" label. The examples below highlight a case where the source / host organism was used as the annotation for the species label, rather than the organism from which the sequence is actually derived from:

d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__Serratia; s__Oryza_sativa
d__Eukaryota; p__Arthropoda; c__Insecta; o__Hemiptera; f__Hemiptera; g__Hemiptera; s__Oryza_sativa

More specifically, we have an insect and a bacterial sequence which are both annotated at the species-level with the scientific name of the host organism from which they were observed, Oryza sativa (rice). Users should be generally cautious about species labels. As far as we are aware, SILVA does not necessarily curate the species-level taxonomy. This is why we optionally provide access to this information. You have been warned!

Also note, we only return the first two “words” of the scientific / organism name. This is important as you may have (nearly) identical sequences that point to very slightly different species-label annotations, such as:

  • s__Clostridioides_difficile
  • s__Clostridioides_difficile_R20291

So, if your sequence is similar to these, you'd think it should be classified as s__Clostridioides_difficile. This will not be the case, as the species-label strings are different. What the classifier may actually return is the upper-level taxonomy g__Clostridioides. This is not the fault of the classifier per se, but a problem of annotation. Because of this we only return the first two words (i.e. Clostridioides and difficile) of the "species-label" string.

Tip: if you opted to not include species labels in your reference taxonomy (which may be spurious anyway), the generation of your SILVA classifier will be substantially quicker, while also requiring less memory and storage space!

:arrow_up: Return to Table of Contents

“Culling” low-quality sequences with cull-seqs

Here we’ll remove sequences that contain 5 or more ambiguous bases (IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length. These are the default parameters. See the --help text for more details.

qiime rescript cull-seqs \
    --i-sequences silva-138.1-ssu-nr99-seqs.qza \
    --o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza

:arrow_up: Return to Table of Contents

Filtering sequences by length and taxonomy

Rather than blindly filter all of the reference sequences below a certain length, we'll differentially filter based on the taxonomy of the reference sequence. The reason: if we decide to remove any sequences below 1000 or 1200 bp, then many of the reference sequences associated with Archaea (and some Bacteria) will be lost. This will potentially increase the retention of shorter and lower-quality Bacterial or Eukaryal sequences. Ultimately causing undue database selection bias. So, we'll attempt to mitigate these issues by differentially filtering based on length. We will remove rRNA gene sequences that do not meet the following criteria: Archaea (16S) >= 900 bp, Bacteria (16S) >= 1200 bp, and any Eukaryota (18S) >= 1400 bp. See help text for more info.

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy silva-138.1-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza 

:arrow_up: Return to Table of Contents

Dereplication of sequences and taxonomy

Given the notes outlined for the SILVA 138.1 NR99 release, there may be identical full-length sequences with either identical or different taxonomies. We'll proceed to dereplicate this data before moving forward. This will help remove redundant sequence data from the database prior to downstream processing. RESCRIPt provide several options for sequence-taxonomy dereplication. Click on the triangle below for more information.

Dereplication considerations

Depending on your final use case for this taxonomic information, and the classifier you wish to build, there are some things to consider:

  • Do you plan to leverage tools like clawback?
  • Do you plan to make use of closed-reference clustering approaches?
  • Perhaps you simply would like a database in which there is one consensus taxonomy per unique sequence? Which might be required for other taxonomic classification tools.

The difference between these three considerations, and the way in which you can dereplicate your reference data, is as follows:

  • If you do plan to make use of tools such as q2-clawback, it would be ideal to keep sequences that are identical, despite having differing taxonomy. While any data that match in both their taxonomy and sequence will simply be dereplicated to reduce the database size. For more details about why this is helpful, see Kaehler et al. 2019. Use --p-mode uniq.

  • If you plan to make use of a variety of close-reference-like approaches, and/or simply would like your reference database to simply have a single consensus taxonomy for a given unique sequence... then complete dereplication of your reference sequences and taxonomy is required, i.e. you'll have to make a consensus taxonomy. There are two options for creating a consensus taxonomy for this use case: lca (lowest common ancestor), and majority (finds the most common taxonomic label associated with a given unique sequence).

If you take a peek at the taxonomy after dereplication, you may notice taxonomy strings that appear like this:

d__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__; g__; s__

In this case the consensus taxonomy, constructed by lca (lowest common ancestor), determined that d__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales is the LCA taxonomy. The lower-level taxonomy is then filled in with empty place-holders f__; g__; s__. This back-filling of rank placeholders is required feature-classifier fit-classifier-naive-bayes and feature-classifier classify-sklearn, as all rank positions must have a value.

Finally, If you wish to cluster the reference sequences prior to building your classifier, you can do so by adding the --p-perc-identity parameter. Although not always necessary, this is helpful for a couple of practical reasons, e.g. to save on memory and/or storage requirements. In the example below we construct a 97% reference sequence database with a ‘lca’ computed consensus taxonomy. You can do this my modifying the previous command as follows:

qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138.1-ssu-nr99-tax.qza \
    --p-perc-identity 0.97 \
    --p-mode 'lca' \
    --o-dereplicated-sequences silva-138.1-ssu-c97-seqs-derep-lca.qza \
    --o-dereplicated-taxa silva-138.1-ssu-c97-tax-derep-lca.qza

Dereplicating in uniq mode

Here we will use the default uniq approach. That is, we’ll retain identical sequence records that have differing taxonomies. We’ll specify the option here for the sake of clarity, but feel free to use any of the --p-mode options that make sense to you. You can also choose the taxonomic ranks you'd like to backfill using the --p-rank-handles flag. Which we won't set here, as the default, e.g.:
--p-rank-handles 'domain' 'phylum' 'class' 'order' 'family' 'genus' 'species'
will work for us.

qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-filt.qza  \
    --i-taxa silva-138.1-ssu-nr99-tax.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza

:sparkler: Wonderful! :sparkler:
Now we are ready to make our classifier for use on full-length SSU sequences. :hammer_and_wrench:

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads  silva-138.1-ssu-nr99-seqs-derep-uniq.qza \
  --i-reference-taxonomy silva-138.1-ssu-nr99-tax-derep-uniq.qza \
  --o-classifier silva-138.1-ssu-nr99-classifier.qza

The RESCRIPt pipeline evaluate-fit-classifier can also be used to substitute fit-classifier-naive-bayes and evaluate the classifier all in one go. See below for more details.

:arrow_up: Return to Table of Contents

Make amplicon-region specific classifier

Here, we'll go into more detail about how to generate an amplicon-specific classifier. Constructing such a classifier allows for more robust taxonomic classification of your data (Werner et al. 2011, Bokulich et al. 2018).

Here, you'll use the same primer sequences that you used for your PCR / sequencing, to extract the amplicon region from our reference database. Note, we enter the primer sequences in the 5'-3' direction, i.e. as you would order the oligos from a vendor. Here, we'll extract the commonly used V4 EMP Primers, and use our pre-filtered full-length sequences from above. Also, note that we'll set --p-read-orientation 'forward', as the SILVA database is curated to be in the same "forward" orientation. This will allow us to process the data more quickly w/o having to account for mixed-orientation sequences during our primer search. We’ll continue by making use of the files from previous steps.

qiime feature-classifier extract-reads \
    --i-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer GTGYCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACNVGGGTWTCTAAT \
    --p-n-jobs 2 \
    --p-read-orientation 'forward' \
    --o-reads silva-138.1-ssu-nr99-seqs-515f-806r.qza

Note / Tip: Depending on your goals, it may also be reasonable to use the raw imported sequences, or output from either cull-seqs or filter-seqs-length[-by-taxon] as input into the above extract-reads command. Be aware that you may need to run reverse-transcribe prior to other steps outlined in this tutorial.

:warning: There are some caveats to relying solely on using PCR primer pairs to extract your amplicon region of interest. Please see this tutorial for more information.

Dereplicate extracted region

Even though we already dereplicated our full-length sequences, we'll do so again as the extracted amplicon regions may now be identical over this shorter region.

  • We may have unique sequences that point to quite different taxonomies (e.g. different genera). Thus, we need to decide how we’d like to handle the taxonomy by choosing one of the available --p-mode options.
  • Conversely, we may have had many different full-length sequences with identical taxonomy, and now these records, after extracting the amplicon region, are identical. As a result of this, we can reduce our database size by simply dereplicating them, as we’ve done earlier.
qiime rescript dereplicate \
    --i-sequences silva-138.1-ssu-nr99-seqs-515f-806r.qza \
    --i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-515f-806r-uniq.qza \
    --o-dereplicated-taxa  silva-138.1-ssu-nr99-tax-515f-806r-derep-uniq.qza

:tada: Great! Now we can build our amplicon-region specific classifier. :hammer_and_wrench:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads silva-138.1-ssu-nr99-seqs-515f-806r-uniq.qza \
    --i-reference-taxonomy silva-138.1-ssu-nr99-tax-515f-806r-derep-uniq.qza \
    --o-classifier silva-138.1-ssu-nr99-515f-806r-classifier.qza

Tip: The RESCRIPt pipeline evaluate-fit-classifier can also be used to substitute fit-classifier-naive-bayes and evaluate the classifier all in one go. See below for more details.

:arrow_up: Return to Table of Contents

Database (and sequence) evaluation functions

In addition to methods for generating and curating sequence reference databases, RESCRIPt has several methods for evaluating databases and database quality.

If you wish to follow along download the following GreenGenes 85% reference data files:

85_otus.qza: view | download
ref-taxonomy.qza view | download

:arrow_up: Return to Table of Contents

Sequence database evaluation with the `evaluate-*` family of actions

Note: training and classifying full datasets is a time-consuming task; these are powerful complete benchmark methods, not quick estimates. So be prepared to wait several hours to complete testing of large databases such as SILVA (and note that parallelization options are available for some methods). :alarm_clock:

One way to evaluate database quality is to classify the sequences in the database. RESCRIPt currently has two methods to do this:

Click to read details
  1. evaluate-cross-validate performs k-fold cross-validation to simulate classification tasks where query sequences are not present in the database and may not have an exact match. The database is split into K folds for testing; classifiers are trained K times on the remaining sequences; and the test set is classified using the corresponding classifier. This way each sequence is in the test set exactly 1 time and in the training set exactly K-1 times. Sequences are stratified so that members of the species are split between each training and test set, but in cases of singleton species RESCRIPt trims the taxonomic labels to reflect the best possible classification (usually, trimming the expected taxonomic classification to genus level if no other members of the species are present in the training set). This method outputs the predicted taxonomic classifications for each sequence, as well as the expected taxonomies (after stratification/trimming). These can be used as input to evaluate-classifications (see below) to estimate classifier accuracy.
  2. evaluate-fit-classifier trains and tests a single classifier on the entire set of sequences. This makes a “perfect classifier” (with intentional data leakage) to simulate the best possible classification accuracy, when all query sequences have a perfect match in the reference database. This will enable two things: (a) it provides an estimate of best possible accuracy using evaluate-classifications (see below) and (b) this outputs a trained classifier that is identical to the one that the q2-feature-classifier action fit-classifier-naive-bayes generates (that method is wrapped by RESCRIPt), and can be used for further classification tasks (instead of using q2-feature-classifier separately). Hence, this method generates a trained classifier AND gives you an estimate of its best possible accuracy.

For example, we can train a classifier using RESCRIPt (and q2-feature-classifier) with the following command as a replacement for the fit-classifier-naive-bayes command given above:

qiime rescript evaluate-fit-classifier \
    --i-sequences 85_otus.qza \
    --i-taxonomy ref-taxonomy.qza  \
    --o-classifier gg85-otu-taxonomy-classifier.qza \
    --o-observed-taxonomy gg85-otu-taxonomy-predicted-taxonomy.qza \
    --o-evaluation gg85-otu-taxonomy-fit-classifier-evaluation.qzv 

Output Visualization
gg85-otu-taxonomy-fit-classifier-evaluation.qzv: view | download (386.9 KB)

Using the Greengenes 85% OTUs is an easy classification task, so the results sure are boring! :sleeping: We see 100% classification accuracy, but this is an unrealistically simple task. If you don't mind waiting, try it on a more complex dataset like the Greengenes 97% OTUs.

The classifier gg85-otu-taxonomy-classifier.qza output by this command is identical to what would be generated by q2-feature-classifier fit-classifier-naive-bayes (since q2-feature-classifier is used in the pipeline!). The output of evaluate-classifications is generated as part of this pipeline, so you can see the simulated classification accuracy of our GreenGenes classifier in gg85-otu-taxonomy-fit-classifier-evaluation.qzv. The predicted taxonomy is also output in case you want to manually run evaluate-classifications or inspect the results, e.g., with qiime metadata tabulate.

:arrow_up: Return to Table of Contents

Evaluate classification accuracy with evaluate-classifications

This method takes a list of observed taxonomies and a corresponding list of expected taxonomies (which must be input in the same order) and calculates various metrics to evaluate classification accuracy. The command below replicates what was already done by the evaluate-fit-classifier pipeline above, to evaluate classification accuracy of our GreenGenes classifier. Note the labels parameter used below — this is just used to label each classification result corresponding to the inputs.

qiime rescript evaluate-classifications \
	--i-expected-taxonomies ref-taxonomy.qza \
	--i-observed-taxonomies gg85-otu-taxonomy-predicted-taxonomy.qza \
	--p-labels gg85 \
	--o-evaluation gg85-otu-taxonomy-classifier-evaluation.qzv

Output visualization
gg85-otu-taxonomy-classifier-evaluation.qzv: view | download (386.9 KB)
Note: this will be identical to the output of evaluate-fit-classifier. And equally boring, for the reasons explained above...

:arrow_up: Return to Table of Contents

Evaluate taxonomic information with `evaluate-taxonomy

This method provides information about the number, entropy, and other characteristics of taxonomic labels at each rank in an input taxonomy. It is much less time-consuming than the classification methods above, and provides valuable information about the amount of “information” in a taxonomy artifact. This is very useful for reference databases, but also potentially useful for biological results as well (e.g., to evaluate the depth and richness of taxonomic classifications). It can take multiple taxonomies as input, so let’s look at our observed and expected taxonomies from above to calculate various stats:

qiime rescript evaluate-taxonomy \
  --i-taxonomies ref-taxonomy.qza gg85-otu-taxonomy-predicted-taxonomy.qza \
  --p-labels gg85-taxonomy gg85-predicted-taxonomy \
  --o-taxonomy-stats gg85-taxonomy-evaluation.qzv

Output visualization
gg85-taxonomy-evaluation.qzv: view | download (387.7 KB)

:arrow_up: Return to Table of Contents

General sequence data operations

Many of the operations described in the SILVA formatting tutorial above can be applied to other reference databases in general. In addition, RESCRIPt contains several other actions that can be used for managing and manipulating both reference sequences and other sequences (e.g., experimentally derived observations).

:arrow_up: Return to Table of Contents

Filtering sequences by length

The filter-seqs-length method uses VSEARCH under the hood to quickly filter sequences globally by length. The filter-seqs-length-by-taxon method described above can also filter sequences by global min and max length thresholds, but is designed for more specific use cases, e.g., where different length thresholds are desired for different taxonomic groups. Sometimes a quick filter is desired, irrespective of taxonomy, and the filter-seqs-length method provides a quick, and much faster, fix.

This method is useful for both reference sequences as well as experimentally derived sequences — for example, this method could be used to filter out abnormally short or long sequences from a set of “representative sequences” (e.g., following denoising), as length outliers often indicate either non-target sequences or mismatched merging of paired-end reads.

Here is an example using the SILVA sequences from above.

qiime rescript filter-seqs-length \
	--i-sequences 85_otus.qza \
	--p-global-min 900 \
	--p-global-max 1500 \
	--o-filtered-seqs filtered-seqs.qza \
	--o-discarded-seqs discarded-seqs.qza

Sequences within the length thresholds will be saved to filtered-seqs.qza, and those that do not will be saved to discarded-seqs.qza just in case you want to take a look!

:arrow_up: Return to Table of Contents

Orient sequences by alignment to reference

Sometimes sequences wind up in mixed orientations, i.e., a single file can contain reads from both the forward and reverse strands of a DNA sequence. This is most common in query sequences derived from non-directionalized library preparation methods, but is occasionally seen in reference sequence collections that are compiled from different sources and/or have not been aligned. The orient-seqs method can orient these sequences by comparison to a reference sequence set known to be oriented in the correct direction.

qiime rescript orient-seqs \
	--i-sequences query-sequences.qza \
	--i-reference-sequences reference-sequences.qza \
	--o-oriented-seqs oriented-query-sequences.qza \
	--o-unmatched-seqs unmatched-sequences.qza

Any sequences that match are output in the correct(ed) orientation are saved to oriented-query-sequences.qza. Any sequences that do not match any sequences in the reference (and hence cannot be oriented) are saved to unmatched-sequences.qza for inspection. Note that this method has various parameters to control the alignment similarity thresholds (as orientation is performed by using VSEARCH global alignment under the hood).

:arrow_up: Return to Table of Contents

Reverse Transcribe

It is not uncommon to find reference sequences only provided in RNA form, as is the case with the parsing of the SILVA reference data we performed earlier. What actually happened, “behind the scenes”, was that the SILVA sequences were reverse transcribed from RNA to DNA. Did you notice the FeatureData[RNASequence] type when we manually imported the sequences at the beginning? We’ll take that file and reverse transcribe manually here.

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138.1-ssu-nr99-seqs.qza  \
    --o-dna-sequences silva-138.1-ssu-nr99-dna.qza

:arrow_up: Return to Table of Contents

De-gapping alignments

There may be an occasion where the sequence data you are interested in using, only exists in DNA alignment form. Perhaps you misplaced your original unaligned file, but still have the sequence alignment available. We provide a way in which you can ‘unalign’ or ‘degap’ your DNA alignment file. Keep your :eyes: peeled for future improvements, e.g. handling RNA alignments.

qiime rescript degap-seqs \
   --i-aligned-sequences my-seq-alignment.qza
   --o-degapped-sequences -my-degapped-seqs.qza

Tip: If you would like to also remove any sequences that are too short, you can set the --p-min-length option. This parameter is applied after de-gapping and is useful for, e.g., dropping sequences that are mostly gaps!

:arrow_up: Return to Table of Contents

Editing taxonomy

Often the taxonomy associated with a given sequence may be misannotated. This can occur for a variety of reasons ranging from simple mistakes, to inconsistent annotations / updates within taxonomic groups, through changes in nomenclatural rules to define taxonomy, and etc.

Typical use cases:

  • Edit the taxonomy prior to generating a taxonomic classifier.
  • Make taxonomy more "friendly" for plotting figures
    • Relabel non-official / un-accepted taxonomic labels, which are likely to cause confusion.
    • Truncate taxonomic names (e.g. based on a regular expression) so that the full lineage is not displayed in something like a heatmap.
  • Or any other reasons we've not listed here. :slight_smile:

The table below, for example, highlights a case in which the genus-level annotation for the species Salmonella enterica is inconsistent. The two labels are used g__Salmonella and g__Escherichia-Shigella.

Accession Taxonomy
AB680788.1.1466 d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Salmonella; s__Salmonella_enterica
AB680791.1.1466 d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Salmonella; s__Salmonella_enterica
... ...
CZLR01000032.33487.34870 d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia-Shigella; s__Salmonella_enterica
KM244788.1.1511 d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia-Shigella; s__uncultured_Salmonella
... ...

When reference sequences are identical (or nearly so), but have inconsistent taxonomic information, the ability of our reference database to accurately classify our query sequences decreases. In this case, we'd never be able to classify any of our S. enterica sequences to either the genus or species level. That is, our classifier would "see" the conflicting genus labels and opt to simply report the Lowest Common Ancestor (LCA) our query sequence. That is, our S. enterica sequences would be classified only to the family-level, i.e. Enterobacteriaceae. Normally, we'd have to manually edit these taxonomic labels so that they are all consistent, prior to training our taxonomic classifier.

Fortunately, we've made it easier to make such edits via rescript edit-taxonomy. This action takes two list of terms: search strings and replacement strings. These can be provided either at the command line or via a tab-delimited metadata file. Below we'll show two ways to "fix" the inconsistent genus labels for S. enterica.

For our example we'll convert all occurrences of g__Salmonella; to g__Escherichia-Shigella;, while also converting s__uncultured_Salmonella to s__uncultured_Escherichia-Shigella. Note, these are simply examples to highlight multiple search and replacement strings, do not take this as an authoritative statement of Salmonella taxonomy!

Command line example:

qiime rescript edit-taxonomy \
    --i-taxonomy my-taxonomy.qza \
    --p-search-strings g__Salmonella; s__uncultured_Salmonella \
    --p-replacement-strings g__Escherichia-Shigella; s__uncultured_Escherichia-Shigella \
    --o-edited-taxonomy my-taxonomy-fixed.qza

As you can see, each search term in --p-search-strings is replaced by its corresponding replacement term in --p-replacement-strings, separated by spaces. Yes, in this example, the ; is indeed part of the search string itself, it is not a delimiter.)

:warning: Note any taxonomic labels that contain spaces should be encapsulated in quotes on the command line (not necessary if provided within a tab-delimited file). That is, if you really wanted to replace something like p__BRC1 with p__Incertae sedis and g__ with g__Candidatus Sumerlaea, then you'd enter the following:

  ...
    --p-search-strings 'p__BRC1' 'g__' \
    --p-replacement-strings 'p__Incertae sedis' 'g__Candidatus Sumerlaea' \
  ...

Alternatively, you can provide the search and replacement strings as a metadata file, in the form of:

id replacements
g__Salmonella; g__Escherichia-Shigella;
s__uncultured_Salmonella s__uncultured_Escherichia-Shigella

Note: the first column must contain one of the allowed identifier column headers such as id. This column contains all of the strings to search for. The second column can be of any name, and contains all the corresponding replacements strings.

When using the metadata file approach our command would then become:

qiime rescript edit-taxonomy \
    --i-taxonomy my-taxonomy.qza \
    --m-replacement-map-file taxonomy-replacements.tsv \
    --m-replacement-map-column replacements \
    --o-edited-taxonomy my-taxonomy-fixed.qza

Tip: if search and replace terms are provided via both the command line and a metadata file, they will be combined. Also, note that the search and replace terms will only be recorded in provenance when using the --p-search-strings and --p-replacement-strings at the command line. Finally, for more advanced users, regular expressions can be enabled via the --p-use-regex flag.

Now you have consistent taxonomic labels for your a few of your reference sequences! :tada:

:arrow_up: Return to Table of Contents

Merging taxonomy results

The merge-taxa method supports merging two or more taxonomy artifacts into a single taxonomy artifact based on different criteria. Currently, the merging modes supported are:

  1. lca, which creates a consensus taxonomy by finding the least common ancestor of any intersecting taxonomy strings in the inputs.
  2. len, which selects the taxonomy with the most elements (e.g., if one taxonomy is assigned to species level and the other is not)
  3. score, which selects the taxonomy with the top score for any intersecting taxonomy strings in the inputs.

This method is useful for creating “ensemble” taxonomic classifications, i.e., you can taxonomically classify your query sequences with two or more methods and then find a consensus taxonomy between them (e.g., using lca mode).

More examples of this method will be added later. For now, see qiime rescript merge-taxa --help to learn more about this method.

:arrow_up: Return to Table of Contents

Visualization gallery

Several actions available in RESCRIPt generate interactive visualizations for exploring results. Below are some **static** renderings of examples used in this tutorial.

evaluate-classifications

Explore the visualization:
gg85-otu-taxonomy-classifier-evaluation.qzv: view | download (386.9 KB)

evaluate-fit-classifier

Explore the visualization:
gg85-otu-taxonomy-fit-classifier-evaluation.qzv: view | download (386.9 KB)

evaluate-taxonomy

Explore the visualization:
gg85-taxonomy-evaluation.qzv: view | download (387.7 KB)

:arrow_up: Return to Table of Contents

32 Likes

This is just perfect. Thank you all!

2 Likes

A couple of super short observations @SoilRotifer @Nicholas_Bokulich as I'm following this tutorial for COI work. The block of code following the following section needs a couple of tiny adjustments:

As it currently is shown:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --i--p-perc-identity 97 \
...

It's just a type with the last line I've shown in the code block above ^^.

Replace --i--p-perc-identity with --p-perc-identity like this:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --p-perc-identity 97 \
...

Along the same lines, the expected value in the --p-perc-identity is now expected to be a value between 0 and 1, instead of an integer from 0 to 100. So in reality, the code I'd need to run would be:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --p-perc-identity 0.97 \
...

Thanks for the great tool - more to come on my end soon with the COI tests.

5 Likes

Nice catch @devonorourke!

5 posts were split to a new topic: Importing BOLD COI sequence data into QIIME 2 / RESCRIPt

Update: if working with 1.8 million COI sequences, be prepared to wait about 12 days...
:sleeping:

5 Likes

A post was split to a new topic: RESCRIPt get-ncbi-data timeout error

Hello, great job.
Did you distinguish between 'Consensus' or 'Majority Taxonomies'as described in the Qiime2-formatted SILVA 132 release notes?

Best regards,

Hi @arwqiime,

If you run qiime rescript dereplicate --help you can read the descriptions on how each of the dereplication modes work on the sequences and taxonomy.

--p-mode TEXT Choices('uniq', 'lca', 'majority', 'super')
                          How to handle dereplication when sequences map to
                          distinct taxonomies. "uniq" will retain all
                          sequences with unique taxonomic affiliations. "lca"
                          will find the least common ancestor among all taxa
                          sharing a sequence. "majority" will find the most
                          common taxonomic label associated with that
                          sequence; note that in the event of a tie,
                          "majority" will pick the winner arbitrarily. "super"
                          finds the LCA consensus while giving preference to
                          majority labels and collapsing substrings into
                          superstrings. For example, when a more specific
                          taxonomy does not contradict a less specific
                          taxonomy, the more specific is chosen. That is,
                          "g__Faecalibacterium; s__prausnitzii", will be
                          preferred over "g__Faecalibacterium; s__"

For pre-made classifiers we used the uniq setting.

-Mike

5 Likes

A post was split to a new topic: how to convert FeatureData[RNASequence] to FeatureData[Sequence]

3 posts were split to a new topic: creating a 12-rank SILVA taxonomy with RESCRIPt

A post was split to a new topic: Creating a V3 classifier killed.

An off-topic reply has been split into a new topic: Testing feature classifier accuracy

Please keep replies on-topic in the future.

2 off-topic replies have been split into a new topic: Should I cluster my reference sequences to 97% for my classifier?

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Is there a way to parallelize evaluate-fit-classifier?

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Modifying taxonomic annotation from RESCRIPt

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: Slightly different taxa with regional and full length taxonomy classifiers

Please keep replies on-topic in the future.

An off-topic reply has been split into a new topic: rescript reverse-transcribe error: file not found

Please keep replies on-topic in the future.

There is a typo in the command, it should be --p-replacement-strings not --p-replacementS-strings

1 Like

:man_facepalming:

Thank you for pointing this out @lxsteiner ! :pray: