Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development. Please feel free to ask questions and provide feedback. Happy :qiime2:ing!
Citation:
If you use either RESCRIPt and/or vsearch to process your data, as outlined in this tutorial, please cite the following:
- Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
- Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. “VSEARCH: A Versatile Open Source Tool for Metagenomics.” PeerJ 4 (October): e2584. doi: 10.7717/peerj.2584.
To install:
- Make sure you are using QIIME 2 version 2023.2 or later:
- Install the latest version of RESCRIPt.
- Proceed to the tutorial.
The problem:
Often we wish to obtain reference sequences that spans our sequence-region of interest, from an online database such as GenBank. Often this is done by using PCR primer pairs to search for and extract the appropriate sequence segments from a larger sequence record. However, our primer sequences may not be contained within the sequences we retrieve, many of which merely being sequence segments of full-length gene sequences themselves:
As you can see, our amplicon region of interest may be within the sequence set that we've downloaded (blue, green, purple). However, we'll fail to retrieve some of these sequences (blue, orange, & red) when relying solely on PCR primer pair search / extraction. That is, a reference sequence might not contain any of the primers (blue), only contain one of our primers (orange & red), and some that span most of the region of interest, which may still be useful to keep, despite containing only one of the primers (orange).
Fortunately, there is a simple solution to our problem: use the sequences we obtained from a sequence repository as queries against our curated set of reference-sequence-segments (e.g. reference amplicon sequences). When any portion of our query sufficiently aligns to our reference segments, then that segment of our query, or query-segments, will be written to file.
Here we'll use the --qsegout
(query segments out) command of vsearch to find sequences that have matching segments to our region of interest. Here we'll provide an example of how RESCRIPt, can make the process of making a custimized reference database much less onerous in these cases.
Expanding our reference pool
Another advantage to extracting query-segments, is the ability to widen our pool of reference sequences to construct a robust reference database. That is, if our starting pool of sequence-segments is not very diverse, our ability to extract all relevant query-segments may be sub-optimal. To address this, we can iterate through several query-segment searches to increase the diversity of our reference pool. In brief, we'll use the extracted query-seqments from a prior iteration as our new reference-sequence-segments pool. This new pool will be used to extract a new set of query-segments. We'll provide more details as we proceed through our example below.
Fetch trnL(UAA) gene data
The trnL (Leu) marker gene is commonly used in environmental DNA (eDNA) surveys, e.g. studying the plant diets of feral swine (Robeson et al. 2017), and Bison (Bergmann et al. 2015). This marker gene serves as a great dataset to showcase the new extract-seq-segments
action from the RESCRIPt plugin. We'll be extracting the trnL (Leu) amplicon region using the Taberlet et al. 2007 primers (see below). We'll do this by using the new rescript extract-seq-segments
action.
Primer name | Primer sequence (5' - 3') |
---|---|
trnL-g | GGGCAATCCTGAGCCAA |
trnL-h | CCATTGAGTCTCTGCACCTATC |
We'll use RESCRIPt's get-ncbi-data
to fetch only Streptophyta ("green plants") sequences. More information can be found here.
We'll also be sure to exclude a variety of potentially unhelpful reference sequences. We'll do all of this by using the following entrez search term to download Streptophyta. This is a large set of data, if you'd like to use a smaller data set I recommend using a clade within this group.
txid35493[ORGN] AND (trnL OR tRNA-Leu OR trn-L OR trn L) AND (chloroplast[Filter] OR plastid[Filter]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]
Finally, we'll specify a list of taxonomic ranks that we'd like to extract for each sequence we download from GenBank. To make sure that we have a value for each rank, we'll enable --p-rank-propagation
. See the note on rank propagation within the RESCRIPt SILVA tutorial.
! qiime rescript get-ncbi-data \
--p-query "txid35493[ORGN] AND (trnL OR tRNA-Leu OR trn-L OR trn L) AND (chloroplast[Filter] OR plastid[Filter]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \
--p-ranks kingdom phylum class order family genus species \
--p-rank-propagation \
--p-n-jobs 1 \
--o-sequences trnL-ref-seqs.qza \
--o-taxonomy trnL-ref-tax.qza \
--verbose
Dereplicate data
Sequence repositories like GenBank, and others, often contain quite a bit of redundant data. For example, different research groups may generate sequence data for the same set of taxa with similar approaches (i.e. identical primers). We'll dereplicate the sequence data to reduce the size of our database, which will also reduce the time spent on all other downstream database curational steps.
! qiime rescript dereplicate \
--i-sequences trnL-ref-seqs.qza \
--i-taxa trnL-ref-tax.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences trnL-ref-seqs-derep.qza \
--o-dereplicated-taxa trnL-ref-tax-derep.qza
Optional: Generate an initial reference pool of sequence segments using primer-pair search
If you already have a reference segment sequence pool available, you can skip this primer pair extraction step and proceed to the Expanding our reference segment sequence pool portion of this tutorial:
- For example, if we were working with the 16S rRNA marker gene, we could use the silva-138-99-seqs-515-806.qza file available from the Data resources page, as our reference sequence segments to extract the V4 region from the a set of full-length 16S rRNA sequences.
- If you are aware of any previously accessible data within the GenBank SRA that might serve as a good reference sequence pool, consider using q2-fondue. In regard to this tutorial you may consider downloading the trnL amplicon sequences from BioProject PRJNA415437. Then clean and process them as you would any amplicon data set. Then use these reads as a starting point for constructing your reference sequence pool.
Otherwise, you can generate an initial reference pool by extracting the region of interest via a PCR primer-pair search. Using the PCR primer-pair search approach, we'll initially try to extract as many segments from the trnL (Leu) gene sequences that we've downloaded from GenBank. We'll obviously lose quite a large amount of data, as not all of the sequences will contain the primer sequences, as discussed above. Again, this can be for a few reasons:
- The primers were trimmed / removed from the sequences prior to being deposited in a sequence repository (e.g. GenBank). Thus there will be no "hits" to these sequences using a primer-pair search.
- The sequence spans a region not targeted by either of the primers.
- The sequence only contains a match to one primer sequence (e.g. the sequence may have used a primer pair that is offset from the pair you've selected).
- The sequence is of low quality and finding a match is difficult
! qiime feature-classifier extract-reads \
--i-sequences trnL-ref-seqs-derep.qza \
--p-f-primer GGGCAATCCTGAGCCAA \
--p-r-primer CCATTGAGTCTCTGCACCTATC \
--p-min-length 10 \
--p-max-length 250 \
--p-n-jobs 8 \
--o-reads trnL-gh-seq-segments.qza
Tip: Our expected amplicon length is ~ 143 bases, give or take (see later in the tutorial). But it's often a good idea to set some length limits around this to provide some 'breathing room'. How much is debatable, as the length of this gene can vary substantially depending on the target organisms of interest. For example, some organisms can have a mean trnL sequence length upwards of 700 bp! Adjust accordingly.
Removing low-quality sequences
We'll proceed to dereplicate and clean our sequences from our target extracted amplicon region. It's quite common for sequence repositories like GenBank to contain sequence data with ambiguous IUPAC nucleotides like N
, R
, Y
, M
..., etc. We can remove such sequences with the cull-seqs
action. The extracted trnL amplicon sequences are very short, because of this we'll be a bit more strict and allow for fewer degeneracies. We'll remove any sequences that contain one or more ambiguous bases. Leaving too many ambiguous bases might increase the chances of extracting spurious segments.
In fact, we'll repeatedly remove reads with ambiguous bases after each iteration of the sequence segment extraction steps that we'll discuss below, as there is a chance we'll recruit the same or other low-quality sequence segments. We'll leave the homopolymer setting as is for now. Feel free to change if needed.
We'll also dereplicate after each cycle to reduce the number of redundant searches. That is, extracted segments from larger reads may be identical. Also, this can reduce file size. In most cases, being this diligent at each cycle of preparing the database is not needed. However, as this is a large dataset for a very short amplicon region, we'll attempt to be as thorough as possible here.
! qiime rescript dereplicate \
--i-sequences trnL-gh-seq-segments.qza \
--i-taxa trnL-ref-tax-derep.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences trnL-gh-seq-segments-derep.qza \
--o-dereplicated-taxa trnL-gh-tax-segments-derep.qza
! qiime rescript cull-seqs \
--i-sequences trnL-gh-seq-segments-derep.qza \
--p-n-jobs 8 \
--p-num-degenerates 1 \
--p-homopolymer-length 8 \
--o-clean-sequences trnL-gh-seq-segments-derep-cull.qza
! qiime feature-table tabulate-seqs \
--i-data trnL-gh-seq-segments-derep-cull.qza \
--o-visualization trnL-gh-seq-segments-derep-cull.qzv
Tip: You can further limit the length of segments you initially extract by adding
--p-min-length
and--p-max-length
to the abovedereplicate
command.
Expanding our reference segment sequence pool
We'll now use these cleaned reference sequence segments as our reference-segment-sequences pool. That is, we'll query our retrieved GenBank sequences (the same sequence set from which we initially extracted our target amplicon region) against this PCR primer extracted reference-segment-sequences pool. Any GenBank query sequence that contains a segment that matches to any of our reference-segment-sequences (within a user defined threshold) will be extracted and written to file. This will create a new, and hopefully larger reference-segment-sequence pool, which we will use for another iteration of query segment search and extraction. Again, this approach obviates the need for PCR primer sequences.
We are essentially following the approach originally developed by Edgar (2010 & 2016) as outlined here, and was applied in Robeson et al. 2017 (which we are emulating here).
First iteration of query sequence segment extraction
We'll use --p-perc-identity 0.7
for our search. Typically, setting 70% - 90% is ideal. In fact, you can change this setting between iterations. For example, starting the first few iterations at 70% and the last few iterations at 90%-95%. The most conservative approach would be to set 90% for each iteration. This will reduce the chances of extracting spurious sequence segments, but will often require more iterations to increase your reference sequence pool.
! qiime rescript extract-seq-segments \
--i-input-sequences trnL-ref-seqs-derep.qza \
--i-reference-segment-sequences trnL-gh-seq-segments-derep-cull.qza \
--p-perc-identity 0.7 \
--p-min-seq-len 10 \
--p-threads 8 \
--o-extracted-sequence-segments trnL-gh-extracted-seq-segments-01.qza \
--o-unmatched-sequences trnL-gh-unmatched-sequences-01.qza \
--verbose
Filter again: We'll perform another set clean-up, as with each iteration we'll likely extract sequence segments that are of poor quality or contain many IUPAC ambiguity codes. If we do not remove these at each iteration, especially for these very short trnL gene sequences, we'll increase our retention of very spurious sequence segments.
! qiime rescript dereplicate \
--i-sequences trnL-gh-extracted-seq-segments-01.qza \
--i-taxa trnL-ref-tax-derep.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences trnL-gh-extracted-seq-segments-derep-01.qza \
--o-dereplicated-taxa trnL-gh-extracted-tax-segments-derep-01.qza
! qiime rescript cull-seqs \
--i-sequences trnL-gh-extracted-seq-segments-derep-01.qza \
--p-n-jobs 8 \
--p-num-degenerates 1 \
--p-homopolymer-length 8 \
--o-clean-sequences trnL-gh-extracted-seq-segments-derep-cull-01.qza
! qiime feature-table tabulate-seqs \
--i-data trnL-gh-extracted-seq-segments-derep-cull-01.qza \
--o-visualization trnL-gh-extracted-seq-segments-derep-cull-01.qzv
If you compare the initial PCR extracted segments from the feature-classifier extract-reads
command, and with a little cleaning, we end up with trnL-gh-seq-segments-derep-cull.qzv
. Compare the sequence counts of that initial step to the first iteration of rescript extract-seq-segments
, i.e. trnL-gh-extracted-seq-segments-derep-cull-01.qzv
, again after a little cleaning. We've increased our reference pool of unique seqeunces by several times, compared to the primer search alone!
This might be good enough, but lets try one more iteration. We do this by now using the trnL-gh-extracted-seq-segments-derep-cull-01.qza
as our new reference-sequence-segment pool and re-run the same trnL-ref-seqs-derep.qza
against it. As you can see each iteration expands our ability to capture more of our target sequence segments of interest.
The second iteration.
Let's repeat the same sequence of steps using rescript extract-seq-segments
. For this iteration we'll use trnL-gh-extracted-seq-segments-derep-cull-01.qza
as input to --i-reference-segment-sequences
.
! qiime rescript extract-seq-segments \
--i-input-sequences trnL-ref-seqs-derep.qza \
--i-reference-segment-sequences trnL-gh-extracted-seq-segments-derep-cull-01.qza \
--p-perc-identity 0.7 \
--p-min-seq-len 10 \
--p-threads 8 \
--o-extracted-sequence-segments trnL-gh-extracted-seq-segments-02.qza \
--o-unmatched-sequences trnL-gh-unmatched-sequences-02.qza \
--verbose
Clean the output...
! qiime rescript dereplicate \
--i-sequences trnL-gh-extracted-seq-segments-02.qza \
--i-taxa trnL-ref-tax-derep.qza \
--p-mode 'uniq' \
--p-threads 8 \
--o-dereplicated-sequences trnL-gh-extracted-seq-segments-derep-02.qza \
--o-dereplicated-taxa trnL-gh-extracted-tax-segments-derep-02.qza
! qiime rescript cull-seqs \
--i-sequences trnL-gh-extracted-seq-segments-derep-02.qza \
--p-n-jobs 8 \
--p-num-degenerates 1 \
--p-homopolymer-length 8 \
--o-clean-sequences trnL-gh-extracted-seq-segments-derep-cull-02.qza
! qiime feature-table tabulate-seqs \
--i-data trnL-gh-extracted-seq-segments-derep-cull-02.qza \
--o-visualization trnL-gh-extracted-seq-segments-derep-cull-02.qzv
Whoa! Now we have an even greater pool of unique reference sequences. We'll stop here. Note: if you're able to extract sequence segments from ~50-70% of your original query sequences, you likely have enough to stop the segment extraction iterations and move on to other downstream curation tasks. Through each iteration we are retaining sequence segments that are within a 70% match of the reference pool. As we add more references to the pool we increase our ability to match new sequences. Keep in mind you want to balance the acquisition of as many reference sequence segments as you can, against the potential of extracting too many spurious segments that may decrease the effectiveness of your reference database!
Optional curation
Filter by sequence length
The sequences we find on GenBank may be too short (or too long) for our needs. We can remove such sequences, for our purposes here. We'll make use of RESCRIPt's filter-seqs-length
action. Note, if you are interested in length trimming based on taxonomy you can use filter-seqs-length-by-taxon
.
Remember, the primer set from Taberlet et al. 2007, and can produce fragments that vary in length from 10 - 143 bases. But recent data in GenBank suggest up-to ~250 bases. Let's filter around that length, for the sake of example. Again, be warned that some organisms can have a mean trnL sequence length upwards of 700 bp! Adjust length trimming accordingly!
! qiime rescript filter-seqs-length \
--i-sequences trnL-gh-extracted-seq-segments-derep-cull-02.qza \
--p-global-min 10 \
--p-global-max 250 \
--o-filtered-seqs trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
--o-discarded-seqs trnL-gh-extracted-seq-segments-derep-cull-keep-discard-02.qza
! qiime rescript filter-taxa \
--i-taxonomy trnL-ref-tax-derep.qza \
--m-ids-to-keep-file trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
--o-filtered-taxonomy trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza
Let's take a look at what we've got!
Let's evaluate our trnL reference database, and generate some visualizations.
! qiime metadata tabulate \
--m-input-file trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \
--o-visualization trnL-gh-extracted-tax-segments-derep-cull-keep-02.qzv
! qiime rescript evaluate-taxonomy \
--i-taxonomies trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \
--o-taxonomy-stats trnL-gh-extracted-tax-segments-derep-cull-keep-eval-02.qzv
! qiime rescript evaluate-seqs \
--i-sequences trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
--p-kmer-lengths 16 8 4 2 \
--o-visualization trnL-gh-extracted-seq-segments-derep-cull-keep-eval-02.qzv
As you can see, after all of our processing we have quite a bit of trnL amplicon reference sequences compared to when we started. We've also generated a few basic descriptors of these references too. Let's take these and build a classifier, and evaluate it.
Build and evaluate our classifier
We are now ready to construct a classifier for our trnL(Leu) reference sequences. We'll use the evaluate-fit-classifier
, as this will not only make our classifier just like qiime feature-classifier fit-classifier-naive-bayes
, but will also provide an evaluation of our "best-case estimate" of accuracy (i.e., when all query sequences have one or more known matches within our reference database. See our other tutorials for more details.
Note this will take quite a while to complete! You can opt to skip the commands below and simply train the classifier with feature-classifier fit-classifier-naive-bayes.
! qiime rescript evaluate-fit-classifier \
--i-sequences trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
--i-taxonomy trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \
--p-n-jobs 2 \
--o-classifier ncbi-trnL-gh-refseqs-classifier.qza \
--o-evaluation ncbi-trnL-gh-refseqs-classifier-evaluation.qzv \
--o-observed-taxonomy ncbi-trnL-gh-refseqs-predicted-taxonomy.qza
! qiime rescript evaluate-taxonomy \
--i-taxonomies trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza ncbi-trnL-gh-refseqs-predicted-taxonomy.qza \
--p-labels ref-taxonomy predicted-taxonomy \
--o-taxonomy-stats trnL-gh-ref-taxonomy-evaluation.qzv
Try other amplicons!
This approach should work well for a wide array of other amplicons, e.g.:
- Small-Sub-Unit rRNA (SSU)
- Large-Sub-Unit rRNA (LSU)
- 12S rRNA (12S)
- Cytochrome b (CytoB)
- Cytochrome oxidase subunit 1 (CO1)
- Internal Transcribed Spacer (ITS)
- RuBisCo (rbcL)
- megakaryocyte-associated tyrosine kinase (matk)
- etc...