Processing, filtering, and evaluating (reference) sequence data with RESCRIPt

:construction: Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development.

RESCRIPt

RESCRIPt (REference Sequence annotation and CuRatIon Pipeline) is a python package and QIIME 2 plugin for formatting, managing, and manipulating sequence reference databases. This package was designed for compiling, manipulating, and evaluating sequence reference databases from SILVA, NCBI, Greengenes, GTDB, and other sources, and for constructing reference databases for use with QIIME 2 (or other microbiome analysis software and taxonomy classifiers). This tutorial describes several primary pipelines and actions available in RESCRIPt.

Citation:

A proper software announcement is in progress. For now, if you use RESCRIPt or any RESCRIPt-processed data in your research, please cite the following:

:fireworks: Provenance tracked sequence database generation. Woohoo! :fireworks:

Table of Contents

  1. Preparing the SILVA reference database
    a. Getting SILVA data the easy way
    b. “Culling” low-quality sequences with cull-seqs
    c. Filtering sequences by length and taxonomy
    d. Dereplication of sequences and taxonomy
    e. Make amplicon-region specific classifier
  2. Database (and sequence) evaluation functions
    a. Sequence database evaluation with the evaluate-* family of actions
    b. Evaluate classification accuracy with evaluate-classifications
    c. Evaluate taxonomic information with evaluate-taxonomy
  3. General sequence data operations
    a. Filtering sequences by length
    b. Orient sequences by alignment to reference
    c. Reverse Transcribe
    d. De-gapping alignments
  4. Merging taxonomy results

:partying_face:
Note: The tutorial below uses primarily SILVA data by way of comparison, but note that all of the steps demonstrated beyond step 1a can be applied to any sequence reference data.

Preparing the SILVA reference database

We’ll use RESCRIPt to prepare a QIIME 2 compatible SSU SILVA reference database based on the curated NR99 (version 138) database. We chose this version mainly for the reasons outlined here.

SILVA compilation pipeline

  • Below is a simple outline of the steps involved for constructing a QIIME 2 compatible reference from SILVA.
  • Begin by downloading the relevant taxonomy and sequence files from the SILVA.
  • Import these files into QIIME 2.
  • Prepare a 7-rank taxonomy file.
  • Remove sequences with excessive degenerate bases and homopolymers.
  • Remove sequences that may be too short and/or long. With the option to condition the length filtering based on taxonomy.
  • Dereplicate the sequences and taxonomy.
  • Build our classifier.

:warning:
NOTE: pre-processed SILVA sequence and taxonomy Artifacts (and taxonomy classifiers) have been generated and released by the QIIME 2 team, following the same steps described below. You can get the pre-processed Artifacts and classifiers here: https://docs.qiime2.org/2020.6/data-resources/

For purposes of this tutorial, we’ll use the current SILVA SSU release (version 138). There are two approaches you can use to import and process the SILVA reference taxonomy and sequences for use in QIIME 2. We’ll start with “Getting SILVA data the easy way”. However, we recommend that you at least read through the “hard-way” steps to understand what RESCRIPt is doing, and how the SILVA taxonomy is parsed).

Getting SILVA data: Hard Mode.

The gritty details

Download SILVA files

First, we’ll need to go to the SILVA v138 archive to obtain:

the following taxonomy files:

  • tax_slv_ssu_138.txt.gz
  • taxmap_slv_ssu_ref_nr_138.txt.gz
  • tax_slv_ssu_138.tre.gz
    the sequence file:
  • SILVA_138_SSURef_Nr99_tax_silva_trunc.fasta.gz

You can download the files through your browser directly. We’ll make use of wget to download the files from the command line, then gunzip these files prior to importing into QIIME 2.

Download the Taxonomy Rank file. This maps the taxonomic rank and taxonomy to the taxid.

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/taxonomy/tax_slv_ssu_138.txt.gz

gunzip tax_slv_ssu_138.txt.gz

Download the Taxonomy Map file. This maps the sequence Accessions to the Organism Name and Taxonomy IDs.

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.txt.gz

gunzip taxmap_slv_ssu_ref_nr_138.txt.gz

Download the Taxonomy Tree file. This file contains the hierarchical relationship of the taxonomy IDs in tree form.

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/taxonomy/tax_slv_ssu_138.tre.gz

gunzip tax_slv_ssu_138.tre.gz

Download the SILVA NR99 sequences (non-redundant and unaligned)

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138/Exports/SILVA_138_SSURef_NR99_tax_silva_trunc.fasta.gz

gunzip SILVA_138_SSURef_NR99_tax_silva_trunc.fasta.gz

Import SILVA files into QIIME 2

Import the Taxonomy Rank file:

qiime tools import \
    --type 'FeatureData[SILVATaxonomy]' \
    --input-path tax_slv_ssu_138.txt \
    --output-path taxranks-silva-138-ssu-nr99.qza \

Import the Taxonomy Mapping file

qiime tools import \
    --type 'FeatureData[SILVATaxidMap]' \
    --input-path taxmap_slv_ssu_ref_nr_138.txt \
    --output-path taxmap-silva-138-ssu-nr99.qza \

Import the Taxonomy Hierarchy Tree file:

qiime tools import \
    --type 'Phylogeny[Rooted]' \
    --input-path tax_slv_ssu_138.tre \
    --output-path taxtree-silva-138-nr99.qza \

Import the sequence file:

qiime tools import \
    --type 'FeatureData[RNASequence]' \
    --input-path SILVA_138_SSURef_NR99_tax_silva_trunc.fasta \
    --output-path silva-138-ssu-nr99-seqs.qza

We are now ready to proceed with making our SILVA reference database within QIIME 2. First we’ll need to prepare the silva taxonomy prior to use. We’ll use parse-silva-taxonomy to do this and use the optional flag to include the species labels. But be wary, there are species label annotations that may be spurious! See the caveats about using species-labels later in this document.

qiime rescript parse-silva-taxonomy \
    --i-taxonomy-tree taxtree-silva-138-nr99.qza \
    --i-taxonomy-map taxmap-silva-138-ssu-nr99.qza \
    --i-taxonomy-ranks taxranks-silva-138-ssu-nr99.qza \
    --p-include-species-labels \
    --o-taxonomy silva-138-ssu-nr99-tax.qza

Great work! We now have a properly formatted QIIME 2 compatible SILVA taxonomy and sequence files. We’ll use these for the remaining downstream steps.

:arrow_up: Return to Table of Contents

Getting SILVA data the easy way

We actually have a nice convenience method to perform all the tasks outlined in the Hard Mode section, using a single command!* Which is… :drum: :

qiime rescript get-silva-data \
    --p-version '138' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences silva-138-ssu-nr99-seqs.qza \
    --o-silva-taxonomy silva-138-ssu-nr99-tax.qza

This pipeline will essentially perform all of the steps we outlined above and return a ready-to-use sequence and taxonomy file for any other curation steps you’d like to perform. We highly recommend that you perform some additional quality-control on these files, as we’ll outline below. Feel free to alter the pipeline that best suits your needs. Also note we are using the --p-include-species-labels flag. For more information please see the hidden details below:

Species-labels: caveat emptor!

The examples below highlight a case where the source / host organism was used as the annotation for the species label, rather than the organism from which the sequence is actually derived from:

d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacteriales; f__Enterobacteriaceae; g__Serratia; s__Oryza_sativa
d__Eukaryota; p__Arthropoda; c__Insecta; o__Hemiptera; f__Hemiptera; g__Hemiptera; s__Oryza_sativa

More specifically, we have an insect and a bacterial sequence which are both annotated at the species-level with the scientific name of the host organism from which they were observed, Oryza sativa (rice). Users should be generally cautious about species labels. As far as we are aware, SILVA does not necessarily curate the species-level taxonomy. This is why we optionally provide access to this information. You have been warned.

Also note, we only return the first two “words” of the scientific / organism name. This is important as you may have (nearly) identical sequences that point to very slightly different species-label annotations, such as:

  • s__Clostridioides_difficile
  • s__Clostridioides_difficile_R20291

So, if your sequence is similar to these, you’d think it should be classified as s__Clostridioides_difficile. This will not be the case, as the species-label strings are different. What the classifier may actually return is the upper-level taxonomy g__Clostridioides. This is not the fault of the classifier per se, but a problem of annotation. Because of this we only return the first two words (i.e. Clostridioides and difficile) of the “species-label” string.

Tip: if you opted to not include species labels in your reference taxonomy (which may be spurious anyway), the generation of your SILVA classifier will be substantially quicker, while also requiring less memory and storage space!

:arrow_up: Return to Table of Contents

“Culling” low-quality sequences with cull-seqs

Here we’ll remove sequences that contain 5 or more ambiguous bases (IUPAC compliant ambiguity bases) and any homopolymers that are 8 or more bases in length. These are the default parameters. See the --help text for more details.

qiime rescript cull-seqs \
    --i-sequences silva-138-ssu-nr99-seqs.qza \
    --o-clean-sequences silva-138-ssu-nr99-seqs-cleaned.qza

:arrow_up: Return to Table of Contents

Filtering sequences by length and taxonomy

Rather than blindly filter all of the reference sequences below a certain length, we’ll differentially filter based on the taxonomy of the reference sequence. The reason: if we decide to remove any sequences below 1000 or 1200 bp, then many of the reference sequences associated with Archaea (and some Bacteria) will be lost. This will potentially increase the retention of shorter and lower-quality Bacterial or Eukaryal sequences. Ultimately causing undue database selection bias. So, we’ll attempt to mitigate these issues by differentially filtering based on length. We will remove rRNA gene sequences that do not meet the following criteria: Archaea (16S) >= 900 bp, Bacteria (16S) >= 1200 bp, and any Eukaryota (18S) >= 1400 bp. See help text for more info.

qiime rescript filter-seqs-length-by-taxon \
    --i-sequences silva-138-ssu-nr99-seqs-cleaned.qza \
    --i-taxonomy silva-138-ssu-nr99-tax.qza \
    --p-labels Archaea Bacteria Eukaryota \
    --p-min-lens 900 1200 1400 \
    --o-filtered-seqs silva-138-ssu-nr99-seqs-filt.qza \
    --o-discarded-seqs silva-138-ssu-nr99-seqs-discard.qza 

:arrow_up: Return to Table of Contents

Dereplication of sequences and taxonomy

Given the notes outlined for the SILVA 138 NR99 release, there may be identical full-length sequences with either identical or different taxonomies. We’ll proceed to dereplicate this data before moving forward. This will help remove redundant sequence data from the database prior to downstream processing. RESCRIPt provide several options for sequence-taxonomy dereplication. See the hidden details below for more information.

Dereplication considerations

Depending on your final use case for this taxonomic information, and the classifier you wish to build, there are some things to consider:

  • Do you plan to leverage tools like clawback?
  • Do you plan to make use of closed-reference clustering approaches?
  • Perhaps you simply would like a database in which there is one consensus taxonomy per unique sequence? Which might be required for other taxonomic classification tools.

The difference between these three considerations, and the way in which you can dereplicate your reference data, is as follows:

  • If you do plan to make use of tools such as q2-clawback, it would be ideal to keep sequences that are identical, despite having differing taxonomy. While any data that match in both their taxonomy and sequence will simply be dereplicated to reduce the database size. For more details about why this is helpful, see Kaehler et al. 2019. Use --p-mode uniq.

  • If you plan to make use of a variety of close-reference-like approaches, and/or simply would like your reference database to simply have a single consensus taxonomy for a given unique sequence… then complete dereplication of your reference sequences and taxonomy is required, i.e. you’ll have to make a consensus taxonomy. There are two options for creating a consensus taxonomy for this use case: lca (lowest common ancestor), and majority (finds the most common taxonomic label associated with a given unique sequence).

If you take a peek at the taxonomy after dereplication, you may notice taxonomy strings that appear like this:

d__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales; f__; g__; s__

In this case the consensus taxonomy, constructed by lca (lowest common ancestor), determined that d__Bacteria; p__Firmicutes; c__Bacilli; o__Bacillales is the LCA taxonomy. The lower-level taxonomy is then filled in with empty place-holders f__; g__; s__. This back-filling of rank placeholders is required feature-classifier fit-classifier-naive-bayes and feature-classifier classify-sklearn, as all rank positions must have a value.

Finally, If you wish to cluster the reference sequences prior to building your classifier, you can do so by adding the --p-perc-identity parameter. Although not always necessary, this is helpful for a couple of practical reasons, e.g. to save on memory and/or storage requirements. In the example below we construct a 97% reference sequence database with a ‘lca’ computed consensus taxonomy. You can do this my modifying the previous command as follows:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --p-perc-identity 0.97 \
    --p-rank-handles 'silva' \
    --p-mode 'lca' \
    --o-dereplicated-sequences silva-138-ssu-c97-seqs-derep-lca.qza \
    --o-dereplicated-taxa silva-138-ssu-c97-tax-derep-lca.qza

Dereplicating in uniq mode

Here we will use the default uniq approach. That is, we’ll retain identical sequence records that have differing taxonomies. We’ll specify the option here for the sake of clarity, but feel free to use any of the --p-mode options that make sense to you.

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza  \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa silva-138-ssu-nr99-tax-derep-uniq.qza

:sparkler: Wonderful! :sparkler:
Now we are ready to make our classifier for use on full-length SSU sequences. :hammer_and_wrench:

qiime feature-classifier fit-classifier-naive-bayes \
  --i-reference-reads  silva-138-ssu-nr99-seqs-derep-uniq.qza \
  --i-reference-taxonomy silva-138-ssu-nr99-tax-derep-uniq.qza \
  --o-classifier silva-138-ssu-nr99-classifier.qza

The RESCRIPt pipeline evaluate-fit-classifier can also be used to substitute fit-classifier-naive-bayes and evaluate the classifier all in one go. See below for more details.

:arrow_up: Return to Table of Contents

Make amplicon-region specific classifier

Here, we’ll go into more detail about how to generate an amplicon-specific classifier. Constructing such a classifier allows for more robust taxonomic classification of your data (Werner et al. 2011, Bokulich et al. 2018).

Here, you’ll use the same primer sequences that you used for your PCR / sequencing, to extract the amplicon region from our reference database. Here, we’ll extract the commonly used V4 EMP Primers, and use our pre-filtered full-length sequences from above. Also, note that we’ll set --p-read-orientation 'forward', as the SILVA database is curated to be in the same “forward” orientation. This will allow us to process the data more quickly w/o having to account for mixed-orientation sequences during our primer search. We’ll continue by making use of the files from previous steps.

qiime feature-classifier extract-reads \
    --i-sequences silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --p-f-primer GTGYCAGCMGCCGCGGTAA \
    --p-r-primer GGACTACNVGGGTWTCTAAT \
    --p-n-jobs 2 \
    --p-read-orientation 'forward' \
    --o-reads silva-138-ssu-nr99-seqs-515f-806r.qza

Note / Tip: Depending on your goals, it may also be reasonable to use the raw imported sequences, or output from either cull-seqs or filter-seqs-length[-by-taxon] as input into the above extract-reads command. Be aware that you may need to run reverse-transcribe prior to other steps outlined in this tutorial.

Dereplicate extracted region

Even though we already dereplicated our full-length sequences, we’ll do so again as the extracted amplicon regions may now be identical over this shorter region.

  • We may have unique sequences that point to quite different taxonomies (e.g. different genera). Thus, we need to decide how we’d like to handle the taxonomy by choosing one of the available --p-mode options.
  • Conversely, we may have had many different full-length sequences with identical taxonomy, and now these records, after extracting the amplicon region, are identical. As a result of this, we can reduce our database size by simply dereplicating them, as we’ve done earlier.
qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-515f-806r.qza \
    --i-taxa silva-138-ssu-nr99-tax-derep-uniq.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138-ssu-nr99-seqs-515f-806r-uniq.qza \
    --o-dereplicated-taxa  silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza

:tada: Great! Now we can build our amplicon-region specific classifier. :hammer_and_wrench:

qiime feature-classifier fit-classifier-naive-bayes \
    --i-reference-reads silva-138-ssu-nr99-seqs-515f-806r-uniq.qza \
    --i-reference-taxonomy silva-138-ssu-nr99-tax-515f-806r-derep-uniq.qza \
    --o-classifier silva-138-ssu-nr99-515f-806r-classifier.qza

Tip: The RESCRIPt pipeline evaluate-fit-classifier can also be used to substitute fit-classifier-naive-bayes and evaluate the classifier all in one go. See below for more details.

:arrow_up: Return to Table of Contents

Database (and sequence) evaluation functions

In addition to methods for generating and curating sequence reference databases, RESCRIPt has several methods for evaluating databases and database quality.

If you wish to follow along download the following GreenGenes 85% reference data files:

85_otus.qza: view | download
ref-taxonomy.qza view | download

:arrow_up: Return to Table of Contents

Sequence database evaluation with the `evaluate-*` family of actions

Note: training and classifying full datasets is a time-consuming task; these are powerful complete benchmark methods, not quick estimates. So be prepared to wait several hours to complete testing of large databases such as SILVA (and note that parallelization options are available for some methods). :alarm_clock:

One way to evaluate database quality is to classify the sequences in the database. RESCRIPt currently has two methods to do this:

Click to read details
  1. evaluate-cross-validate performs k-fold cross-validation to simulate classification tasks where query sequences are not present in the database and may not have an exact match. The database is split into K folds for testing; classifiers are trained K times on the remaining sequences; and the test set is classified using the corresponding classifier. This way each sequence is in the test set exactly K times and in the training set exactly K-1 times. Sequences are stratified so that members of the species are split between each training and test set, but in cases of singleton species RESCRIPt trims the taxonomic labels to reflect the best possible classification (usually, trimming the expected taxonomic classification to genus level if no other members of the species are present in the training set). This method outputs the predicted taxonomic classifications for each sequence, as well as the expected taxonomies (after stratification/trimming). These can be used as input to evaluate-classifications (see below) to estimate classifier accuracy.
  2. evaluate-fit-classifier trains and tests a single classifier on the entire set of sequences. This makes a “perfect classifier” (with intentional data leakage) to simulate the best possible classification accuracy, when all query sequences have a perfect match in the reference database. This will enable two things: (a) it provides an estimate of best possible accuracy using evaluate-classifications (see below) and (b) this outputs a trained classifier that is identical to the one that the q2-feature-classifier action fit-classifier-naive-bayes generates (that method is wrapped by RESCRIPt), and can be used for further classification tasks (instead of using q2-feature-classifier separately). Hence, this method generates a trained classifier AND gives you an estimate of its best possible accuracy.

For example, we can train a classifier using RESCRIPt (and q2-feature-classifier) with the following command as a replacement for the fit-classifier-naive-bayes command given above:

qiime rescript evaluate-fit-classifier \
    --i-sequences 85_otus.qza \
    --i-taxonomy ref-taxonomy.qza  \
    --o-classifier gg85-otu-taxonomy-classifier.qza \
    --o-observed-taxonomy gg85-otu-taxonomy-predicted-taxonomy.qza \
    --o-evaluation gg85-otu-taxonomy-fit-classifier-evaluation.qzv 

Output Visualization
gg85-otu-taxonomy-fit-classifier-evaluation.qzv: view | download (386.9 KB)

Using the Greengenes 85% OTUs is an easy classification task, so the results sure are boring! :sleeping: We see 100% classification accuracy, but this is an unrealistically simple task. If you don’t mind waiting, try it on a more complex dataset like the Greengenes 97% OTUs.

The classifier gg85-otu-taxonomy-classifier.qza output by this command is identical to what would be generated by q2-feature-classifier fit-classifier-naive-bayes (since q2-feature-classifier is used in the pipeline!). The output of evaluate-classifications is generated as part of this pipeline, so you can see the simulated classification accuracy of our GreenGenes classifier in gg85-otu-taxonomy-fit-classifier-evaluation.qzv. The predicted taxonomy is also output in case you want to manually run evaluate-classifications or inspect the results, e.g., with qiime metadata tabulate.

:arrow_up: Return to Table of Contents

Evaluate classification accuracy with evaluate-classifications

This method takes a list of observed taxonomies and a corresponding list of expected taxonomies (which must be input in the same order) and calculates various metrics to evaluate classification accuracy. The command below replicates what was already done by the evaluate-fit-classifier pipeline above, to evaluate classification accuracy of our GreenGenes classifier. Note the labels parameter used below — this is just used to label each classification result corresponding to the inputs.

qiime rescript evaluate-classifications \
	--i-expected-taxonomies ref-taxonomy.qza \
	--i-observed-taxonomies gg85-otu-taxonomy-predicted-taxonomy.qza \
	--p-labels gg85 \
	--o-evaluation gg85-otu-taxonomy-classifier-evaluation.qzv

Output visualization
gg85-otu-taxonomy-classifier-evaluation.qzv: view | download (386.9 KB)
Note: this will be identical to the output of evaluate-fit-classifier. And equally boring, for the reasons explained above…

:arrow_up: Return to Table of Contents

Evaluate taxonomic information with `evaluate-taxonomy

This method provides information about the number, entropy, and other characteristics of taxonomic labels at each rank in an input taxonomy. It is much less time-consuming than the classification methods above, and provides valuable information about the amount of “information” in a taxonomy artifact. This is very useful for reference databases, but also potentially useful for biological results as well (e.g., to evaluate the depth and richness of taxonomic classifications). It can take multiple taxonomies as input, so let’s look at our observed and expected taxonomies from above to calculate various stats:

qiime rescript evaluate-taxonomy \
  --i-taxonomies ref-taxonomy.qza gg85-otu-taxonomy-predicted-taxonomy.qza \
  --p-labels gg85-taxonomy gg85-predicted-taxonomy \
  --o-taxonomy-stats gg85-taxonomy-evaluation.qzv

Output visualization
gg85-taxonomy-evaluation.qzv: view | download (387.7 KB)

:arrow_up: Return to Table of Contents

General sequence data operations

Many of the operations described in the SILVA formatting tutorial above can be applied to other reference databases in general. In addition, RESCRIPt contains several other actions that can be used for managing and manipulating both reference sequences and other sequences (e.g., experimentally derived observations).

:arrow_up: Return to Table of Contents

Filtering sequences by length

The filter-seqs-length method uses VSEARCH under the hood to quickly filter sequences globally by length. The filter-seqs-length-by-taxon method described above can also filter sequences by global min and max length thresholds, but is designed for more specific use cases, e.g., where different length thresholds are desired for different taxonomic groups. Sometimes a quick filter is desired, irrespective of taxonomy, and the filter-seqs-length method provides a quick, and much faster, fix.

This method is useful for both reference sequences as well as experimentally derived sequences — for example, this method could be used to filter out abnormally short or long sequences from a set of “representative sequences” (e.g., following denoising), as length outliers often indicate either non-target sequences or mismatched merging of paired-end reads.

Here is an example using the SILVA sequences from above.

qiime rescript filter-seqs-length \
	--i-sequences 85_otus.qza \
	--p-global-min 900 \
	--p-global-max 1500 \
	--o-filtered-seqs filtered-seqs.qza \
	--o-discarded-seqs discarded-seqs.qza

Sequences within the length thresholds will be saved to filtered-seqs.qza, and those that do not will be saved to discarded-seqs.qza just in case you want to take a look!

:arrow_up: Return to Table of Contents

Orient sequences by alignment to reference

Sometimes sequences wind up in mixed orientations, i.e., a single file can contain reads from both the forward and reverse strands of a DNA sequence. This is most common in query sequences derived from non-directionalized library preparation methods, but is occasionally seen in reference sequence collections that are compiled from different sources and/or have not been aligned. The orient-seqs method can orient these sequences by comparison to a reference sequence set known to be oriented in the correct direction.

qiime rescript orient-seqs \
	--i-sequences query-sequences.qza \
	--i-reference-sequences reference-sequences.qza \
	--o-oriented-seqs oriented-query-sequences.qza \
	--o-unmatched-seqs unmatched-sequences.qza

Any sequences that match are output in the correct(ed) orientation are saved to oriented-query-sequences.qza. Any sequences that do not match any sequences in the reference (and hence cannot be oriented) are saved to unmatched-sequences.qza for inspection. Note that this method has various parameters to control the alignment similarity thresholds (as orientation is performed by using VSEARCH global alignment under the hood).

:arrow_up: Return to Table of Contents

Reverse Transcribe

It is not uncommon to find reference sequences only provided in RNA form, as is the case with the parsing of the SILVA reference data we performed earlier. What actually happened, “behind the scenes”, was that the SILVA sequences were reverse transcribed from RNA to DNA. Did you notice the FeatureData[RNASequence] type when we manually imported the sequences at the beginning? We’ll take that file and reverse transcribe manually here.

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138-ssu-nr99-seqs.qza  \
    --o-dna-sequences silva-138-ssu-nr99-dna.qza

:arrow_up: Return to Table of Contents

De-gapping alignments

There may be an occasion where the sequence data you are interested in using, only exists in DNA alignment form. Perhaps you misplaced your original unaligned file, but still have the sequence alignment available. We provide a way in which you can ‘unalign’ or ‘degap’ your DNA alignment file. Keep your :eyes: peeled for future improvements, e.g. handling RNA alignments.

qiime rescript degap-seqs \
   --i-aligned-sequences my-seq-alignment.qza
   --o-degapped-sequences -my-degapped-seqs.qza

Tip: If you would like to also remove any sequences that are too short, you can set the --p-min-length option. This parameter is applied after de-gapping and is useful for, e.g., dropping sequences that are mostly gaps!

:arrow_up: Return to Table of Contents

Merging taxonomy results

The merge-taxa method supports merging two or more taxonomy artifacts into a single taxonomy artifact based on different criteria. Currently, the merging modes supported are:

  1. lca, which creates a consensus taxonomy by finding the least common ancestor of any intersecting taxonomy strings in the inputs.
  2. len, which selects the taxonomy with the most elements (e.g., if one taxonomy is assigned to species level and the other is not)
  3. score, which selects the taxonomy with the top score for any intersecting taxonomy strings in the inputs.

This method is useful for creating “ensemble” taxonomic classifications, i.e., you can taxonomically classify your query sequences with two or more methods and then find a consensus taxonomy between them (e.g., using lca mode).

More examples of this method will be added later. For now, see qiime rescript merge-taxa --help to learn more about this method.

6 Likes

This is just perfect. Thank you all!

2 Likes

A couple of super short observations @SoilRotifer @Nicholas_Bokulich as I’m following this tutorial for COI work. The block of code following the following section needs a couple of tiny adjustments:

As it currently is shown:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --i--p-perc-identity 97 \
...

It’s just a type with the last line I’ve shown in the code block above ^^.

Replace --i--p-perc-identity with --p-perc-identity like this:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --p-perc-identity 97 \
...

Along the same lines, the expected value in the --p-perc-identity is now expected to be a value between 0 and 1, instead of an integer from 0 to 100. So in reality, the code I’d need to run would be:

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --p-perc-identity 0.97 \
...

Thanks for the great tool - more to come on my end soon with the COI tests.

2 Likes

Nice catch @devonorourke!