Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development.
RESCRIPt (REference Sequence annotation and CuRatIon Pipeline) is a python package and QIIME 2 plugin for formatting, managing, and manipulating sequence reference databases. This package was designed for compiling, manipulating, and evaluating sequence reference databases from SILVA, NCBI, Greengenes, GTDB, and other sources, and for constructing reference databases for use with QIIME 2 (or other microbiome analysis software and taxonomy classifiers). This tutorial describes several primary pipelines and actions available in RESCRIPt, with a focus on the SILVA 16S rRNA gene database.
Note: This tutorial focuses on use of RESCRIPt with SILVA data; see here for other tutorials using NCBI, COI data, and more!
If you use RESCRIPt or any RESCRIPt-processed data in your research, please cite the following pre-print:
- Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
Provenance tracked sequence database generation. Woohoo!
Table of Contents
Preparing the SILVA reference database
a. Getting SILVA data the easy way
b. “Culling” low-quality sequences with cull-seqs
c. Filtering sequences by length and taxonomy
d. Dereplication of sequences and taxonomy
e. Make amplicon-region specific classifier
Database (and sequence) evaluation functions
a. Sequence database evaluation with the
evaluate-*family of actions
b. Evaluate classification accuracy with
c. Evaluate taxonomic information with
General sequence and taxonomy data operations
a. Filtering sequences by length
b. Orient sequences by alignment to reference
c. Reverse Transcribe
d. De-gapping alignments
e. Editing taxonomy
- Merging taxonomy results
- Visualization gallery
Note: The tutorial below uses primarily SILVA data by way of comparison, but note that all of the steps demonstrated beyond step
1acan be applied to any sequence reference data.
Preparing the SILVA reference database
We'll use RESCRIPt to prepare a QIIME 2 compatible SSU SILVA reference database based on the curated NR99 (version 138.1) database. We chose this version mainly for the reasons outlined here.
- Below is a simple outline of the steps involved for constructing a QIIME 2 compatible reference from SILVA.
- Begin by downloading the relevant taxonomy and sequence files from the SILVA.
- Import these files into QIIME 2.
- Prepare a fixed-rank taxonomy file.
- Remove sequences with excessive degenerate bases and homopolymers.
- Remove sequences that may be too short and/or long. With the option to condition the length filtering based on taxonomy.
- Dereplicate the sequences and taxonomy.
- Build our classifier.
NOTE: pre-processed SILVA sequence and taxonomy Artifacts (and taxonomy classifiers) have been generated and released by the QIIME 2 team, following the same steps described below. You can get the pre-processed Artifacts and classifiers here: Data resources — QIIME 2 2022.2.0 documentation
For purposes of this tutorial, we’ll use the current SILVA SSU release (version 138.1). There are two approaches you can use to import and process the SILVA reference taxonomy and sequences for use in QIIME 2. We’ll start with “Getting SILVA data the easy way”. However, we recommend that you at least read through the "hard-way" steps to understand what RESCRIPt is doing, and how the SILVA taxonomy is parsed).
The gritty details
First, we’ll need to go to the SILVA v138.1 archive to obtain:
the following taxonomy files:
the sequence file:
You can download the files through your browser directly. We’ll make use of
wget to download the files from the command line, then
gunzip these files prior to importing into QIIME 2.
Download the Taxonomy Rank file. This maps the taxonomic rank and taxonomy to the taxid.
wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/tax_slv_ssu_138.1.txt.gz gunzip tax_slv_ssu_138.1.txt.gz
Download the Taxonomy Map file. This maps the sequence Accessions to the Organism Name and Taxonomy IDs.
wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.1.txt.gz gunzip taxmap_slv_ssu_ref_nr_138.1.txt.gz
Download the Taxonomy Tree file. This file contains the hierarchical relationship of the taxonomy IDs in tree form.
wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/taxonomy/tax_slv_ssu_138.1.tre.gz gunzip tax_slv_ssu_138.1.tre.gz
Download the SILVA NR99 sequences (non-redundant and unaligned)
wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz gunzip SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta.gz
Import the Taxonomy Rank file:
qiime tools import \ --type 'FeatureData[SILVATaxonomy]' \ --input-path tax_slv_ssu_138.1.txt \ --output-path taxranks-silva-138.1-ssu-nr99.qza \
Import the Taxonomy Mapping file
qiime tools import \ --type 'FeatureData[SILVATaxidMap]' \ --input-path taxmap_slv_ssu_ref_nr_138.1.txt \ --output-path taxmap-silva-138.1-ssu-nr99.qza \
Import the Taxonomy Hierarchy Tree file:
qiime tools import \ --type 'Phylogeny[Rooted]' \ --input-path tax_slv_ssu_138.1.tre \ --output-path taxtree-silva-138.1-nr99.qza \
Import the sequence file:
qiime tools import \ --type 'FeatureData[RNASequence]' \ --input-path SILVA_138.1_SSURef_NR99_tax_silva_trunc.fasta \ --output-path silva-138.1-ssu-nr99-rna-seqs.qza
Note, the data exist within SILVA as RNA sequences, and thus have been imported as
FeatureData[RNASequence]. To make sure things run smoothly downstream we'll convert the data to
FeatureData[DNASequence] like so:
qiime rescript reverse-transcribe \ --i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza \ --o-dna-sequences silva-138.1-ssu-nr99-seqs.qza
We are now ready to proceed with making our SILVA reference database within QIIME 2. First we’ll need to prepare the silva taxonomy prior to use. We’ll use
parse-silva-taxonomy to do this and use the optional flag to include the species labels. But be wary, there are species label annotations that may be spurious! See the caveats about using species-labels later in this document.
qiime rescript parse-silva-taxonomy \ --i-taxonomy-tree taxtree-silva-138.1-nr99.qza \ --i-taxonomy-map taxmap-silva-138.1-ssu-nr99.qza \ --i-taxonomy-ranks taxranks-silva-138.1-ssu-nr99.qza \ --p-include-species-labels \ --o-taxonomy silva-138.1-ssu-nr99-tax.qza
Great work! We now have a properly formatted QIIME 2 compatible SILVA taxonomy and sequence files. We’ll use these for the remaining downstream steps.
Getting SILVA data the easy way
We actually have a nice convenience method to perform all the tasks outlined in the Hard Mode section, using a single command!* Which is… :
qiime rescript get-silva-data \ --p-version '138.1' \ --p-target 'SSURef_NR99' \ --p-include-species-labels \ --o-silva-sequences silva-138.1-ssu-nr99-rna-seqs.qza \ --o-silva-taxonomy silva-138.1-ssu-nr99-tax.qza
If you'd like to be able to jump to steps that only take
FeatureData[Sequence] as input you can convert your data to
FeatureData[Sequence] like so:
qiime rescript reverse-transcribe \ --i-rna-sequences silva-138.1-ssu-nr99-rna-seqs.qza --o-dna-sequences silva-138.1-ssu-nr99-seqs.qza
This pipeline will essentially perform all of the steps we outlined above in Hard Mode and return a ready-to-use sequence and taxonomy file for any other curation steps you’d like to perform.
Tip: If you'd like to select specific ranks for your SILVA taxonomy, simply add the
--p-ranksflag to the either the
get-silva-datacommands. For example:
--p-ranks domain kingdom phylum class order suborder family subfamily genus
We highly recommend that you perform some additional quality-control on these files, as we’ll outline below. Feel free to alter the pipeline that best suits your needs. Also note we are using the
--p-include-species-labels flag. For more information please see the hidden details below:
When using either
get-silva-data, you'll notice that there is an option to "propagate" the taxonomy, i.e. (
--p-no-rank-propagation). Here, we'll describe what rank propagation is doing.
RESCRIPt parses the taxonomic rank and lineage information from SILVA as shown in the first table below. Both the rank and its associated prefix are shown in the column headers. Later tables only use the rank prefixes.
All available rank information parsed by RESCRIPt (non-rank-propagated):
|id||domain (d__)||superkingdom (sk__)||kingdom (k__)||subkingdom (ks__)||superphylum (sp__)||phylum (p__)||subphylum (ps__)||infraphylum (pi__)||superclass (sc__)||class (c__)||subclass (cs__)||infraclass (ci__)||superorder (so__)||order (o__)||suborder (os__)||superfamily (sf__)||family (f__)||subfamily (fs__)||genus (g__)|
Note the inconsistancy of available rank-lineage information across the different organisms. If we decide to not propagate the taxonomy (i.e. the
--p-no-rank-propagation flag ), and wish to only obtain the standard "6-rank" taxonomy (i.e. domain, phylum, class, order, family, genus). Then our resulting reference taxonomy would look like:
However, if we use rank-propagation (i.e. the default
--p--rank-propagation) the full rank-lineage table above will become:
Full rank-propagated taxonomy:
If we subset our 6-ranks from here, we'll return the following "6-rank" propagated table:
Notice anything interesting in our new "6-rank" taxonomy table? Where did those other lineage names come from? In the case that rank-propagation is used, ranks are propagated (forward-filled) using all available rank-taxonomy information present, prior to sub-setting the taxonomic rank information requested by the user. This is why the two "6-rank" taxonomy tables are slightly different. Rank-propagation helps when taxa do not share consistent rank-lineage information, as evidenced by observing the the full rank-lineage tables above. For example, some taxa have annotations for "superphylum" and other ranks which are not used / annotated. If we only propagated taxonomy after subsetting the ranks, then the resulting reference taxonomy for many organisms would be missing. Note the information (however limited) we gain from propagating the taxonomy for the fungal taxa.
The key take away here is that any ranks not associated with a taxonomic lineage have their upper-level taxonomy propagated downward (i.e. the values are forward filled with the last observed taxonomic value) towards lower-level ranks. This ensures general compatibility with downstream taxonomy classification tools, many of which may require non-empty fields at each rank.
Species-labels: caveat emptor!
The examples below highlight a case where the source / host organism was used as the annotation for the species label, rather than the organism from which the sequence is actually derived from:
d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__Serratia; s__Oryza_sativa
d__Eukaryota; p__Arthropoda; c__Insecta; o__Hemiptera; f__Hemiptera; g__Hemiptera; s__Oryza_sativa
More specifically, we have an insect and a bacterial sequence which are both annotated at the species-level with the scientific name of the host organism from which they were observed, Oryza sativa (rice). Users should be generally cautious about species labels. As far as we are aware, SILVA does not necessarily curate the species-level taxonomy. This is why we optionally provide access to this information. You have been warned.
Also note, we only return the first two “words” of the scientific / organism name. This is important as you may have (nearly) identical sequences that point to very slightly different species-label annotations, such as:
So, if your sequence is similar to these, you'd think it should be classified as s__Clostridioides_difficile. This will not be the case, as the species-label strings are different. What the classifier may actually return is the upper-level taxonomy g__Clostridioides. This is not the fault of the classifier per se, but a problem of annotation. Because of this we only return the first two words (i.e. Clostridioides and difficile) of the "species-label" string.
Tip: if you opted to not include species labels in your reference taxonomy (which may be spurious anyway), the generation of your SILVA classifier will be substantially quicker, while also requiring less memory and storage space!
“Culling” low-quality sequences with cull-seqs
Here we’ll remove sequences that contain 5 or more ambiguous bases (IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length. These are the default parameters. See the
--help text for more details.
qiime rescript cull-seqs \ --i-sequences silva-138.1-ssu-nr99-seqs.qza \ --o-clean-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza
Filtering sequences by length and taxonomy
Rather than blindly filter all of the reference sequences below a certain length, we'll differentially filter based on the taxonomy of the reference sequence. The reason: if we decide to remove any sequences below 1000 or 1200 bp, then many of the reference sequences associated with Archaea (and some Bacteria) will be lost. This will potentially increase the retention of shorter and lower-quality Bacterial or Eukaryal sequences. Ultimately causing undue database selection bias. So, we'll attempt to mitigate these issues by differentially filtering based on length. We will remove rRNA gene sequences that do not meet the following criteria: Archaea (16S) >= 900 bp, Bacteria (16S) >= 1200 bp, and any Eukaryota (18S) >= 1400 bp. See help text for more info.
qiime rescript filter-seqs-length-by-taxon \ --i-sequences silva-138.1-ssu-nr99-seqs-cleaned.qza \ --i-taxonomy silva-138.1-ssu-nr99-tax.qza \ --p-labels Archaea Bacteria Eukaryota \ --p-min-lens 900 1200 1400 \ --o-filtered-seqs silva-138.1-ssu-nr99-seqs-filt.qza \ --o-discarded-seqs silva-138.1-ssu-nr99-seqs-discard.qza
Dereplication of sequences and taxonomy
Given the notes outlined for the SILVA 138.1 NR99 release, there may be identical full-length sequences with either identical or different taxonomies. We'll proceed to dereplicate this data before moving forward. This will help remove redundant sequence data from the database prior to downstream processing. RESCRIPt provide several options for sequence-taxonomy dereplication. See the hidden details below for more information.
Depending on your final use case for this taxonomic information, and the classifier you wish to build, there are some things to consider:
- Do you plan to leverage tools like clawback?
- Do you plan to make use of closed-reference clustering approaches?
- Perhaps you simply would like a database in which there is one consensus taxonomy per unique sequence? Which might be required for other taxonomic classification tools.
The difference between these three considerations, and the way in which you can dereplicate your reference data, is as follows:
If you do plan to make use of tools such as q2-clawback, it would be ideal to keep sequences that are identical, despite having differing taxonomy. While any data that match in both their taxonomy and sequence will simply be dereplicated to reduce the database size. For more details about why this is helpful, see Kaehler et al. 2019. Use
If you plan to make use of a variety of close-reference-like approaches, and/or simply would like your reference database to simply have a single consensus taxonomy for a given unique sequence... then complete dereplication of your reference sequences and taxonomy is required, i.e. you'll have to make a consensus taxonomy. There are two options for creating a consensus taxonomy for this use case:
lca(lowest common ancestor), and
majority(finds the most common taxonomic label associated with a given unique sequence).
If you take a peek at the taxonomy after dereplication, you may notice taxonomy strings that appear like this:
d__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__; g__; s__
In this case the consensus taxonomy, constructed by
lca (lowest common ancestor), determined that
d__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales is the LCA taxonomy. The lower-level taxonomy is then filled in with empty place-holders
f__; g__; s__. This back-filling of rank placeholders is required
feature-classifier fit-classifier-naive-bayes and
feature-classifier classify-sklearn, as all rank positions must have a value.
Finally, If you wish to cluster the reference sequences prior to building your classifier, you can do so by adding the
--p-perc-identity parameter. Although not always necessary, this is helpful for a couple of practical reasons, e.g. to save on memory and/or storage requirements. In the example below we construct a 97% reference sequence database with a ‘lca’ computed consensus taxonomy. You can do this my modifying the previous command as follows:
qiime rescript dereplicate \ --i-sequences silva-138.1-ssu-nr99-seqs-filt.qza \ --i-taxa silva-138.1-ssu-nr99-tax.qza \ --p-perc-identity 0.97 \ --p-rank-handles 'silva' \ --p-mode 'lca' \ --o-dereplicated-sequences silva-138.1-ssu-c97-seqs-derep-lca.qza \ --o-dereplicated-taxa silva-138.1-ssu-c97-tax-derep-lca.qza
Here we will use the default
uniq approach. That is, we’ll retain identical sequence records that have differing taxonomies. We’ll specify the option here for the sake of clarity, but feel free to use any of the
--p-mode options that make sense to you.
qiime rescript dereplicate \ --i-sequences silva-138.1-ssu-nr99-seqs-filt.qza \ --i-taxa silva-138.1-ssu-nr99-tax.qza \ --p-rank-handles 'silva' \ --p-mode 'uniq' \ --o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza \ --o-dereplicated-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza
Now we are ready to make our classifier for use on full-length SSU sequences.
qiime feature-classifier fit-classifier-naive-bayes \ --i-reference-reads silva-138.1-ssu-nr99-seqs-derep-uniq.qza \ --i-reference-taxonomy silva-138.1-ssu-nr99-tax-derep-uniq.qza \ --o-classifier silva-138.1-ssu-nr99-classifier.qza
The RESCRIPt pipeline
evaluate-fit-classifier can also be used to substitute
fit-classifier-naive-bayes and evaluate the classifier all in one go. See below for more details.
Make amplicon-region specific classifier
Here, we'll go into more detail about how to generate an amplicon-specific classifier. Constructing such a classifier allows for more robust taxonomic classification of your data (Werner et al. 2011, Bokulich et al. 2018).
Here, you'll use the same primer sequences that you used for your PCR / sequencing, to extract the amplicon region from our reference database. Note, we enter the primer sequences in the 5'-3' direction, i.e. as you would order the oligos from a vendor. Here, we'll extract the commonly used V4 EMP Primers, and use our pre-filtered full-length sequences from above. Also, note that we'll set
--p-read-orientation 'forward', as the SILVA database is curated to be in the same "forward" orientation. This will allow us to process the data more quickly w/o having to account for mixed-orientation sequences during our primer search. We’ll continue by making use of the files from previous steps.
qiime feature-classifier extract-reads \ --i-sequences silva-138.1-ssu-nr99-seqs-derep-uniq.qza \ --p-f-primer GTGYCAGCMGCCGCGGTAA \ --p-r-primer GGACTACNVGGGTWTCTAAT \ --p-n-jobs 2 \ --p-read-orientation 'forward' \ --o-reads silva-138.1-ssu-nr99-seqs-515f-806r.qza
Note / Tip: Depending on your goals, it may also be reasonable to use the raw imported sequences, or output from either
filter-seqs-length[-by-taxon]as input into the above
extract-readscommand. Be aware that you may need to run
reverse-transcribeprior to other steps outlined in this tutorial.
Even though we already dereplicated our full-length sequences, we'll do so again as the extracted amplicon regions may now be identical over this shorter region.
- We may have unique sequences that point to quite different taxonomies (e.g. different genera). Thus, we need to decide how we’d like to handle the taxonomy by choosing one of the available
- Conversely, we may have had many different full-length sequences with identical taxonomy, and now these records, after extracting the amplicon region, are identical. As a result of this, we can reduce our database size by simply dereplicating them, as we’ve done earlier.
qiime rescript dereplicate \ --i-sequences silva-138.1-ssu-nr99-seqs-515f-806r.qza \ --i-taxa silva-138.1-ssu-nr99-tax-derep-uniq.qza \ --p-rank-handles 'silva' \ --p-mode 'uniq' \ --o-dereplicated-sequences silva-138.1-ssu-nr99-seqs-515f-806r-uniq.qza \ --o-dereplicated-taxa silva-138.1-ssu-nr99-tax-515f-806r-derep-uniq.qza
Great! Now we can build our amplicon-region specific classifier.
qiime feature-classifier fit-classifier-naive-bayes \ --i-reference-reads silva-138.1-ssu-nr99-seqs-515f-806r-uniq.qza \ --i-reference-taxonomy silva-138.1-ssu-nr99-tax-515f-806r-derep-uniq.qza \ --o-classifier silva-138.1-ssu-nr99-515f-806r-classifier.qza
Tip: The RESCRIPt pipeline
evaluate-fit-classifiercan also be used to substitute
fit-classifier-naive-bayesand evaluate the classifier all in one go. See below for more details.
Database (and sequence) evaluation functions
In addition to methods for generating and curating sequence reference databases, RESCRIPt has several methods for evaluating databases and database quality.
If you wish to follow along download the following GreenGenes 85% reference data files:
Sequence database evaluation with the `evaluate-*` family of actions
Note: training and classifying full datasets is a time-consuming task; these are powerful complete benchmark methods, not quick estimates. So be prepared to wait several hours to complete testing of large databases such as SILVA (and note that parallelization options are available for some methods).
One way to evaluate database quality is to classify the sequences in the database. RESCRIPt currently has two methods to do this:
Click to read details
evaluate-cross-validateperforms k-fold cross-validation to simulate classification tasks where query sequences are not present in the database and may not have an exact match. The database is split into K folds for testing; classifiers are trained K times on the remaining sequences; and the test set is classified using the corresponding classifier. This way each sequence is in the test set exactly 1 time and in the training set exactly K-1 times. Sequences are stratified so that members of the species are split between each training and test set, but in cases of singleton species RESCRIPt trims the taxonomic labels to reflect the best possible classification (usually, trimming the expected taxonomic classification to genus level if no other members of the species are present in the training set). This method outputs the predicted taxonomic classifications for each sequence, as well as the expected taxonomies (after stratification/trimming). These can be used as input to
evaluate-classifications(see below) to estimate classifier accuracy.
evaluate-fit-classifiertrains and tests a single classifier on the entire set of sequences. This makes a “perfect classifier” (with intentional data leakage) to simulate the best possible classification accuracy, when all query sequences have a perfect match in the reference database. This will enable two things: (a) it provides an estimate of best possible accuracy using
evaluate-classifications(see below) and (b) this outputs a trained classifier that is identical to the one that the q2-feature-classifier action
fit-classifier-naive-bayesgenerates (that method is wrapped by RESCRIPt), and can be used for further classification tasks (instead of using q2-feature-classifier separately). Hence, this method generates a trained classifier AND gives you an estimate of its best possible accuracy.
For example, we can train a classifier using RESCRIPt (and q2-feature-classifier) with the following command as a replacement for the
fit-classifier-naive-bayes command given above:
qiime rescript evaluate-fit-classifier \ --i-sequences 85_otus.qza \ --i-taxonomy ref-taxonomy.qza \ --o-classifier gg85-otu-taxonomy-classifier.qza \ --o-observed-taxonomy gg85-otu-taxonomy-predicted-taxonomy.qza \ --o-evaluation gg85-otu-taxonomy-fit-classifier-evaluation.qzv
Using the Greengenes 85% OTUs is an easy classification task, so the results sure are boring! We see 100% classification accuracy, but this is an unrealistically simple task. If you don't mind waiting, try it on a more complex dataset like the Greengenes 97% OTUs.
gg85-otu-taxonomy-classifier.qza output by this command is identical to what would be generated by q2-feature-classifier
fit-classifier-naive-bayes (since q2-feature-classifier is used in the pipeline!). The output of
evaluate-classifications is generated as part of this pipeline, so you can see the simulated classification accuracy of our GreenGenes classifier in
gg85-otu-taxonomy-fit-classifier-evaluation.qzv. The predicted taxonomy is also output in case you want to manually run
evaluate-classifications or inspect the results, e.g., with
qiime metadata tabulate.
Evaluate classification accuracy with evaluate-classifications
This method takes a list of observed taxonomies and a corresponding list of expected taxonomies (which must be input in the same order) and calculates various metrics to evaluate classification accuracy. The command below replicates what was already done by the
evaluate-fit-classifier pipeline above, to evaluate classification accuracy of our GreenGenes classifier. Note the
labels parameter used below — this is just used to label each classification result corresponding to the inputs.
qiime rescript evaluate-classifications \ --i-expected-taxonomies ref-taxonomy.qza \ --i-observed-taxonomies gg85-otu-taxonomy-predicted-taxonomy.qza \ --p-labels gg85 \ --o-evaluation gg85-otu-taxonomy-classifier-evaluation.qzv
gg85-otu-taxonomy-classifier-evaluation.qzv: view | download (386.9 KB)
Note: this will be identical to the output of
evaluate-fit-classifier. And equally boring, for the reasons explained above...
Evaluate taxonomic information with `evaluate-taxonomy
This method provides information about the number, entropy, and other characteristics of taxonomic labels at each rank in an input taxonomy. It is much less time-consuming than the classification methods above, and provides valuable information about the amount of “information” in a taxonomy artifact. This is very useful for reference databases, but also potentially useful for biological results as well (e.g., to evaluate the depth and richness of taxonomic classifications). It can take multiple taxonomies as input, so let’s look at our observed and expected taxonomies from above to calculate various stats:
qiime rescript evaluate-taxonomy \ --i-taxonomies ref-taxonomy.qza gg85-otu-taxonomy-predicted-taxonomy.qza \ --p-labels gg85-taxonomy gg85-predicted-taxonomy \ --o-taxonomy-stats gg85-taxonomy-evaluation.qzv
General sequence data operations
Many of the operations described in the SILVA formatting tutorial above can be applied to other reference databases in general. In addition, RESCRIPt contains several other actions that can be used for managing and manipulating both reference sequences and other sequences (e.g., experimentally derived observations).
Filtering sequences by length
filter-seqs-length method uses VSEARCH under the hood to quickly filter sequences globally by length. The
filter-seqs-length-by-taxon method described above can also filter sequences by global min and max length thresholds, but is designed for more specific use cases, e.g., where different length thresholds are desired for different taxonomic groups. Sometimes a quick filter is desired, irrespective of taxonomy, and the
filter-seqs-length method provides a quick, and much faster, fix.
This method is useful for both reference sequences as well as experimentally derived sequences — for example, this method could be used to filter out abnormally short or long sequences from a set of “representative sequences” (e.g., following denoising), as length outliers often indicate either non-target sequences or mismatched merging of paired-end reads.
Here is an example using the SILVA sequences from above.
qiime rescript filter-seqs-length \ --i-sequences 85_otus.qza \ --p-global-min 900 \ --p-global-max 1500 \ --o-filtered-seqs filtered-seqs.qza \ --o-discarded-seqs discarded-seqs.qza
Sequences within the length thresholds will be saved to
filtered-seqs.qza, and those that do not will be saved to
discarded-seqs.qza just in case you want to take a look!
Orient sequences by alignment to reference
Sometimes sequences wind up in mixed orientations, i.e., a single file can contain reads from both the forward and reverse strands of a DNA sequence. This is most common in query sequences derived from non-directionalized library preparation methods, but is occasionally seen in reference sequence collections that are compiled from different sources and/or have not been aligned. The
orient-seqs method can orient these sequences by comparison to a reference sequence set known to be oriented in the correct direction.
qiime rescript orient-seqs \ --i-sequences query-sequences.qza \ --i-reference-sequences reference-sequences.qza \ --o-oriented-seqs oriented-query-sequences.qza \ --o-unmatched-seqs unmatched-sequences.qza
Any sequences that match are output in the correct(ed) orientation are saved to
oriented-query-sequences.qza. Any sequences that do not match any sequences in the reference (and hence cannot be oriented) are saved to
unmatched-sequences.qza for inspection. Note that this method has various parameters to control the alignment similarity thresholds (as orientation is performed by using VSEARCH global alignment under the hood).
It is not uncommon to find reference sequences only provided in RNA form, as is the case with the parsing of the SILVA reference data we performed earlier. What actually happened, “behind the scenes”, was that the SILVA sequences were reverse transcribed from RNA to DNA. Did you notice the
FeatureData[RNASequence] type when we manually imported the sequences at the beginning? We’ll take that file and reverse transcribe manually here.
qiime rescript reverse-transcribe \ --i-rna-sequences silva-138.1-ssu-nr99-seqs.qza \ --o-dna-sequences silva-138.1-ssu-nr99-dna.qza
There may be an occasion where the sequence data you are interested in using, only exists in DNA alignment form. Perhaps you misplaced your original unaligned file, but still have the sequence alignment available. We provide a way in which you can ‘unalign’ or ‘degap’ your DNA alignment file. Keep your peeled for future improvements, e.g. handling RNA alignments.
qiime rescript degap-seqs \ --i-aligned-sequences my-seq-alignment.qza --o-degapped-sequences -my-degapped-seqs.qza
Tip: If you would like to also remove any sequences that are too short, you can set the
--p-min-lengthoption. This parameter is applied after de-gapping and is useful for, e.g., dropping sequences that are mostly gaps!
Editing taxonomyOften the taxonomy associated with a given sequence may be misannotated. This can occur for a variety of reasons ranging from simple mistakes, to inconsistent annotations / updates within taxonomic groups, through changes in nomenclatural rules to define taxonomy, and etc.
Typical use cases:
- Edit the taxonomy prior to generating a taxonomic classifier.
- Make taxonomy more "friendly" for plotting figures
- Relabel non-official / un-accepted taxonomic labels, which are likely to cause confusion.
- Truncate taxonomic names (e.g. based on a regular expression) so that the full lineage is not displayed in something like a heatmap.
- Or any other reasons we've not listed here.
The table below, for example, highlights a case in which the genus-level annotation for the species Salmonella enterica is inconsistent. The two labels are used
|AB680788.1.1466||d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Salmonella; s__Salmonella_enterica|
|AB680791.1.1466||d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Salmonella; s__Salmonella_enterica|
|CZLR01000032.33487.34870||d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia-Shigella; s__Salmonella_enterica|
|KM244788.1.1511||d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia-Shigella; s__uncultured_Salmonella|
When reference sequences are identical (or nearly so), but have inconsistent taxonomic information, the ability of our reference database to accurately classify our query sequences decreases. In this case, we'd never be able to classify any of our S. enterica sequences to either the genus or species level. That is, our classifier would "see" the conflicting genus labels and opt to simply report the Lowest Common Ancestor (LCA) our query sequence. That is, our S. enterica sequences would be classified only to the family-level, i.e. Enterobacteriaceae. Normally, we'd have to manually edit these taxonomic labels so that they are all consistent, prior to training our taxonomic classifier.
Fortunately, we've made it easier to make such edits via
rescript edit-taxonomy. This action takes two list of terms: search strings and replacement strings. These can be provided either at the command line or via a tab-delimited metadata file. Below we'll show two ways to "fix" the inconsistent genus labels for S. enterica.
For our example we'll convert all occurrences of
g__Escherichia-Shigella;, while also converting
s__uncultured_Escherichia-Shigella. Note, these are simply examples to highlight multiple search and replacement strings, do not take this as an authoritative statement of Salmonella taxonomy!
Command line example:
qiime rescript edit-taxonomy \ --i-taxonomy my-taxonomy.qza \ --p-search-strings g__Salmonella; s__uncultured_Salmonella \ --p-replacements-strings g__Escherichia-Shigella; s__uncultured_Escherichia-Shigella \ --o-edited-taxonomy my-taxonomy-fixed.qza
As you can see, each search term in
--p-search-strings is replaced by its corresponding replacement term in
--p-replacements-strings, separated by spaces. Yes, in this example, the
; is indeed part of the search string itself, it is not a delimiter.)
Alternatively, you can provide the search and replacement strings as a metadata file, in the form of:
Note: the first column must contain one of the allowed identifier column headers such as
id. This column contains all of the strings to search for. The second column can be of any name, and contains all the corresponding replacements strings.
When using the metadata file approach our command would then become:
qiime rescript edit-taxonomy \ --i-taxonomy my-taxonomy.qza \ --m-replacement-map-file taxonomy-replacements.tsv \ --m-replacement-map-column replacements \ --o-edited-taxonomy my-taxonomy-fixed.qza
Tip: if search and replace terms are provided via both the command line and a metadata file, they will be combined. Also, note that the search and replace terms will only be recorded in provenance when using the
--p-replacements-stringsat the command line. Finally, for more advanced users, regular expressions can be enabled via the
Now you have consistent taxonomic labels for your a few of your reference sequences!
Merging taxonomy results
merge-taxa method supports merging two or more taxonomy artifacts into a single taxonomy artifact based on different criteria. Currently, the merging modes supported are:
lca, which creates a consensus taxonomy by finding the least common ancestor of any intersecting taxonomy strings in the inputs.
len, which selects the taxonomy with the most elements (e.g., if one taxonomy is assigned to species level and the other is not)
score, which selects the taxonomy with the top score for any intersecting taxonomy strings in the inputs.
This method is useful for creating “ensemble” taxonomic classifications, i.e., you can taxonomically classify your query sequences with two or more methods and then find a consensus taxonomy between them (e.g., using
More examples of this method will be added later. For now, see
qiime rescript merge-taxa --help to learn more about this method.