Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

SoilRotifer · June 25, 2020, 4:33pm

Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development.

RESCRIPt

RESCRIPt (REference Sequence annotation and CuRatIon Pipeline) is a python package and QIIME 2 plugin for formatting, managing, and manipulating sequence reference databases. This package was designed for compiling, manipulating, and evaluating sequence reference databases from SILVA, NCBI, Greengenes, GTDB, and other sources, and for constructing reference databases for use with QIIME 2 (or other microbiome analysis software and taxonomy classifiers). This tutorial describes several primary pipelines and actions available in RESCRIPt, with a focus on the SILVA 16S rRNA gene database.

Note: This tutorial focuses on use of RESCRIPt with SILVA data; see here for other tutorials using NCBI, COI data, and more!

Citation:

If you use RESCRIPt or any RESCRIPt-processed data in your research, please cite the following:

Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581

If you make use of SILVA (the example database highlighted in this tutorial), please be sure to cite the following in your work too:

Pruesse, Elmar, Christian Quast, Katrin Knittel, Bernhard M. Fuchs, Wolfgang Ludwig, Jörg Peplies, and Frank Oliver Glöckner. 2007. “SILVA: A Comprehensive Online Resource for Quality Checked and Aligned Ribosomal RNA Sequence Data Compatible with ARB.” Nucleic Acids Research 35 (21): 7188–96. doi: 10.1093/nar/gkm864
Quast, Christian, Elmar Pruesse, Pelin Yilmaz, Jan Gerken, Timmy Schweer, Pablo Yarza, Jörg Peplies, and Frank Oliver Glöckner. 2013. “The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools.” Nucleic Acids Research 41: D590–96. doi: 10.1093/nar/gks1219

Provenance tracked sequence database generation. Woohoo!

Preparing the SILVA reference database
a. Getting SILVA data the easy way
b. “Culling” low-quality sequences with cull-seqs
c. Filtering sequences by length and taxonomy
d. Dereplication of sequences and taxonomy
e. Make amplicon-region specific classifier
Database (and sequence) evaluation functions
a. Sequence database evaluation with the evaluate-* family of actions
b. Evaluate classification accuracy with evaluate-classifications
c. Evaluate taxonomic information with evaluate-taxonomy
General sequence and taxonomy data operations
a. Filtering sequences by length
b. Orient sequences by alignment to reference
c. Reverse Transcribe
d. De-gapping alignments
e. Editing taxonomy
Merging taxonomy results
Visualization gallery
Available Workflow Pipelines

Note: The tutorial below uses primarily SILVA data by way of comparison, but note that all of the steps demonstrated beyond step 1a can be applied to any sequence reference data. Feel free to modify any of the steps in this tutorial, or their order, to best suit your needs!

Preparing the SILVA reference database

We'll use RESCRIPt to prepare a QIIME 2 compatible SSU SILVA reference database based on the curated NR99 (version 138.2) database. We chose this version mainly for the reasons outlined here.

SILVA compilation pipeline

Below is a simple example outline of the steps involved for constructing a QIIME 2 compatible reference from SILVA.
Begin by downloading the relevant taxonomy and sequence files from the SILVA.
Import these files into QIIME 2.
Prepare a fixed-rank taxonomy file.
Remove sequences with excessive degenerate bases and homopolymers.
Remove sequences that may be too short and/or long. With the option to condition the length filtering based on taxonomy.
Dereplicate the sequences and taxonomy.
Build our classifier.

NOTE: pre-processed SILVA sequence and taxonomy Artifacts (and taxonomy classifiers) have been generated and released by the QIIME 2 team, following the same steps described below. You can get the pre-processed Artifacts and classifiers here: Data resources — QIIME 2 2024.2.0 documentation

For purposes of this tutorial, we’ll use the current SILVA SSU release (version 138.2). There are two approaches you can use to import and process the SILVA reference taxonomy and sequences for use in QIIME 2. We’ll start with “Getting SILVA data the easy way”. However, we recommend that you at least read through the "hard-way" steps to understand what RESCRIPt is doing, and how the SILVA taxonomy is parsed).

Getting SILVA data: Hard Mode.

Click on the triangle below for more details.

The gritty details

Download SILVA files

First, we’ll need to go to the SILVA v138.2 archive to obtain:

the following taxonomy files:

tax_slv_ssu_138.2.txt.gz
taxmap_slv_ssu_ref_nr_138.2.txt.gz
tax_slv_ssu_138.2.tre.gz
the sequence file:
SILVA_138.2_SSURef_Nr99_tax_silva_trunc.fasta.gz

You can download the files through your browser directly. We’ll make use of wget to download the files from the command line, then gunzip these files prior to importing into QIIME 2.

Download the Taxonomy Rank file. This maps the taxonomic rank and taxonomy to the taxid.

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.2/Exports/taxonomy/tax_slv_ssu_138.2.txt.gz

gunzip tax_slv_ssu_138.2.txt.gz

Download the Taxonomy Map file. This maps the sequence Accessions to the Organism Name and Taxonomy IDs.

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.2/Exports/taxonomy/taxmap_slv_ssu_ref_nr_138.2.txt.gz

gunzip taxmap_slv_ssu_ref_nr_138.2.txt.gz

Download the Taxonomy Tree file. This file contains the hierarchical relationship of the taxonomy IDs in tree form.

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.2/Exports/taxonomy/tax_slv_ssu_138.2.tre.gz

gunzip tax_slv_ssu_138.2.tre.gz

Download the SILVA NR99 sequences (non-redundant and unaligned)

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.2/Exports/SILVA_138.2_SSURef_NR99_tax_silva_trunc.fasta.gz

gunzip SILVA_138.2_SSURef_NR99_tax_silva_trunc.fasta.gz

Import SILVA files into QIIME 2

Import the Taxonomy Rank file:

qiime tools import \
    --type 'FeatureData[SILVATaxonomy]' \
    --input-path tax_slv_ssu_138.2.txt \
    --output-path taxranks-silva-138.2-ssu-nr99.qza

Import the Taxonomy Mapping file

qiime tools import \
    --type 'FeatureData[SILVATaxidMap]' \
    --input-path taxmap_slv_ssu_ref_nr_138.2.txt \
    --output-path taxmap-silva-138.2-ssu-nr99.qza

Import the Taxonomy Hierarchy Tree file:

qiime tools import \
    --type 'Phylogeny[Rooted]' \
    --input-path tax_slv_ssu_138.2.tre \
    --output-path taxtree-silva-138.2-nr99.qza

Import the sequence file:

qiime tools import \
    --type 'FeatureData[RNASequence]' \
    --input-path SILVA_138.2_SSURef_NR99_tax_silva_trunc.fasta \
    --output-path silva-138.2-ssu-nr99-rna-seqs.qza

Note, the data exist within SILVA as RNA sequences, and thus have been imported as FeatureData[RNASequence]. To make sure things run smoothly downstream we'll convert the data to FeatureData[DNASequence] like so:

qiime rescript reverse-transcribe \
    --i-rna-sequences silva-138.2-ssu-nr99-rna-seqs.qza \
    --o-dna-sequences silva-138.2-ssu-nr99-seqs.qza

We are now ready to proceed with making our SILVA reference database within QIIME 2. First we’ll need to prepare the silva taxonomy prior to use. We’ll use parse-silva-taxonomy to do this. You can optionally include the --p-include-species-labels flag. But be wary, there are species label annotations that may be spurious! See the caveats about using species-labels later under the hidden menu labeled "Species-labels: caveat emptor!" below.

qiime rescript parse-silva-taxonomy \
    --i-taxonomy-tree taxtree-silva-138.2-nr99.qza \
    --i-taxonomy-map taxmap-silva-138.2-ssu-nr99.qza \
    --i-taxonomy-ranks taxranks-silva-138.2-ssu-nr99.qza \
    --o-taxonomy silva-138.2-ssu-nr99-tax.qza

Great work! We now have a properly formatted QIIME 2 compatible SILVA taxonomy and sequence files. We’ll use these for the remaining downstream steps.

id	domain (d__)	superkingdom (sk__)	kingdom (k__)	subkingdom (ks__)	superphylum (sp__)	phylum (p__)	subphylum (ps__)	infraphylum (pi__)	superclass (sc__)	class (c__)	subclass (cs__)	infraclass (ci__)	superorder (so__)	order (o__)	suborder (os__)	superfamily (sf__)	family (f__)	subfamily (fs__)	genus (g__)
AB671439.1.2071	d__Eukaryota	sk__Nucletmycea	k__Fungi	ks__Dikarya	sp__	p__Ascomycota	ps__Pezizomycotina	pi__	sc__	c__	cs__	ci__	so__	o__	os__	sf__	f__	fs__	g__
Z27393.1.1722	d__Eukaryota	sk__Nucletmycea	k__Fungi	ks__Dikarya	sp__	p__Ascomycota	ps__Taphrinomycotina	pi__	sc__	c__	cs__	ci__	so__	o__	os__	sf__	f__	fs__	g__
KY886363.1.1791	d__Eukaryota	sk__Holozoa	k__Animalia	ks__	sp__Lophotrochozoa	p__Rotifera	ps__	pi__	sc__	c__Monogononta	cs__	ci__	so__	o__Ploimida	os__	sf__	f__	fs__	g__
AJ544654.1.1712	d__Eukaryota	sk__	k__Stramenopiles	ks__	sp__Ochrophyta	p__Diatomea	ps__Bacillariophytina	pi__	sc__	c__Bacillariophyceae	cs__	ci__	so__	o__	os__	sf__	f__Sellaphoraceae	fs__	g__Sellaphora
A16379.1.1485	d__Bacteria	sk__	k__	ks__	sp__	p__Proteobacteria	ps__	pi__	sc__	c__Gammaproteobacteria	cs__	ci__	so__	o__Pasteurellales	os__	sf__	f__Pasteurellaceae	fs__	g__Haemophilus
AJ888030.1.996	d__Archaea	sk__	k__	ks__	sp__	p__Crenarchaeota	ps__	pi__	sc__	c__Thermoprotei	cs__	ci__	so__	o__Sulfolobales	os__	sf__	f__Sulfolobaceae	fs__	g__Acidianus

Accession	Taxonomy
AB680788.1.1466	d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Salmonella; s__Salmonella_enterica
AB680791.1.1466	d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Salmonella; s__Salmonella_enterica
...	...
CZLR01000032.33487.34870	d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia-Shigella; s__Salmonella_enterica
KM244788.1.1511	d__Bacteria; p__Proteobacteria; c__Gammaproteobacteria; o__Enterobacterales; f__Enterobacteriaceae; g__Escherichia-Shigella; s__uncultured_Salmonella
...	...

id	replacements
g__Salmonella;	g__Escherichia-Shigella;
s__uncultured_Salmonella	s__uncultured_Escherichia-Shigella

id	d__	sk__	k__	ks__	sp__	p__	ps__	pi__	sc__	c__	cs__	ci__	so__	o__	os__	sf__	f__	fs__	g__
AB671439.1.2071	d__Eukaryota	sk__Nucletmycea	k__Fungi	ks__Dikarya	sp__Dikarya	p__Ascomycota	ps__Pezizomycotina	pi__Pezizomycotina	sc__Pezizomycotina	c__Pezizomycotina	cs__Pezizomycotina	ci__Pezizomycotina	so__Pezizomycotina	o__Pezizomycotina	os__Pezizomycotina	sf__Pezizomycotina	f__Pezizomycotina	fs__Pezizomycotina	g__Pezizomycotina
Z27393.1.1722	d__Eukaryota	sk__Nucletmycea	k__Fungi	ks__Dikarya	sp__Dikarya	p__Ascomycota	ps__Taphrinomycotina	pi__Taphrinomycotina	sc__Taphrinomycotina	c__Taphrinomycotina	cs__Taphrinomycotina	ci__Taphrinomycotina	so__Taphrinomycotina	o__Taphrinomycotina	os__Taphrinomycotina	sf__Taphrinomycotina	f__Taphrinomycotina	fs__Taphrinomycotina	g__Taphrinomycotina
KY886363.1.1791	d__Eukaryota	sk__Holozoa	k__Animalia	ks__Animalia	sp__Lophotrochozoa	p__Rotifera	ps__Rotifera	pi__Rotifera	sc__Rotifera	c__Monogononta	cs__Monogononta	ci__Monogononta	so__Monogononta	o__Ploimida	os__Ploimida	sf__Ploimida	f__Ploimida	fs__Ploimida	g__Ploimida
AJ544654.1.1712	d__Eukaryota	sk__Eukaryota	k__Stramenopiles	ks__Stramenopiles	sp__Ochrophyta	p__Diatomea	ps__Bacillariophytina	pi__Bacillariophytina	sc__Bacillariophytina	c__Bacillariophyceae	cs__Bacillariophyceae	ci__Bacillariophyceae	so__Bacillariophyceae	o__Bacillariophyceae	os__Bacillariophyceae	sf__Bacillariophyceae	f__Sellaphoraceae	fs__Sellaphoraceae	g__Sellaphora
A16379.1.1485	d__Bacteria	sk__Bacteria	k__Bacteria	ks__Bacteria	sp__Bacteria	p__Proteobacteria	ps__Proteobacteria	pi__Proteobacteria	sc__Proteobacteria	c__Gammaproteobacteria	cs__Gammaproteobacteria	ci__Gammaproteobacteria	so__Gammaproteobacteria	o__Pasteurellales	os__Pasteurellales	sf__Pasteurellales	f__Pasteurellaceae	fs__Pasteurellaceae	g__Haemophilus
AJ888030.1.996	d__Archaea	sk__Archaea	k__Archaea	ks__Archaea	sp__Archaea	p__Crenarchaeota	ps__Crenarchaeota	pi__Crenarchaeota	sc__Crenarchaeota	c__Thermoprotei	cs__Thermoprotei	ci__Thermoprotei	so__Thermoprotei	o__Sulfolobales	os__Sulfolobales	sf__Sulfolobales	f__Sulfolobaceae	fs__Sulfolobaceae	g__Acidianus

Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

RESCRIPt

Citation:

Table of Contents

Preparing the SILVA reference database

SILVA compilation pipeline

Getting SILVA data: Hard Mode.

Download SILVA files

Import SILVA files into QIIME 2

Getting SILVA data the easy way

“Culling” low-quality sequences with cull-seqs

Filtering sequences by length and taxonomy

Dereplication of sequences and taxonomy

Dereplicating in uniq mode

Make amplicon-region specific classifier

Dereplicate extracted region

Database (and sequence) evaluation functions

Sequence database evaluation with the `evaluate-*` family of actions

Evaluate classification accuracy with evaluate-classifications

Evaluate taxonomic information with `evaluate-taxonomy

General sequence data operations

Filtering sequences by length

Orient sequences by alignment to reference

Reverse Transcribe

De-gapping alignments

Editing taxonomy

Merging taxonomy results

Visualization gallery

evaluate-classifications

evaluate-fit-classifier

evaluate-taxonomy

Available Workflow Pipelines

Dereplicating in `uniq` mode