Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development. Especially, considering that this is an 'alpha' release for both the documentation and the currently developing qsegout code branch, and associated GitHub Issues 136 and 140. Expect this document to change substantially over the next couple of months! Please feel free to ask questions and provide feedback. Happy ing!
If you use RESCRIPt or any RESCRIPt-processed data in your research, please cite the following:
- Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
- Install the latest version of vsearch.
Due to current vsearch version dependency conflicts with other plugins within QIIME 2, we'll need to manually download and install the latest version of vsearch elsewhere on our system, not within our QIIME 2 environment. Do not worry, this is in the process of being fixed for a future release of QIIME 2! Choose the version appropriate for your platform and install. We'll come back to this later in the tutorial.
- Install the qsegout code branch:
$ conda activate qiime2-2022.2 $ pip install git+https://github.com/mikerobeson/[email protected] $ qiime dev refresh-cache $ qiime rescript extract-seq-segments --help
Often we wish to obtain reference sequences that spans our sequence-region of interest, from an online database such as GenBank. Often this is done by using PCR primer pairs to search for and extract the appropriate sequence segments from a larger sequence record. However, our primer sequences may not be contained within the sequences we retrieve, many of which merely being sequence segments of full-length gene sequences themselves:
As you can see, our amplicon region of interest may be within the sequence set that we've downloaded (blue, green, purple). However, we'll fail to retrieve some of these sequences (blue, orange, & red) when relying solely on PCR primer pair search / extraction. That is, a reference sequence might not contain any of the primers (blue), only contain one of our primers (orange & red), and some that span most of the region of interest, which may still be useful to keep, despite containing only one of the primers (orange).
Fortunately, there is a simple solution to our problem: use the sequences we obtained from a sequence repository as queries against our curated set of reference-sequence-segments (e.g. reference amplicon sequences). When any portion of our query sufficiently aligns to our reference segments, then that segment of our query, or query-segments, will be written to file.
Here we'll use the
--qsegout (query segments out) command of vsearch to find sequences that have matching segments to our region of interest. Here we'll provide an example of how RESCRIPt, can make the process of making a custimized reference database much less onerous in these cases.
Another advantage to extracting query-segments, is the ability to widen our pool of reference sequences to construct a robust reference database. That is, if our starting pool of sequence-segments is not very diverse, our ability to extract all relevant query-segments may be sub-optimal. To address this, we can iterate through several query-segment searches to increase the diversity of our reference pool. In brief, we'll use the extracted query-seqments from a prior iteration as our new reference-sequence-segments pool. This new pool will be used to extract a new set of query-segments. We'll provide more details as we proceed through our example below.
The trnL (Leu) marker gene is commonly used in environmental DNA (eDNA) surveys, e.g. studying the plant diets of feral swine (Robeson et al. 2017), and Bison (Bergmann et al. 2015). This marker gene serves as a great dataset to showcase the new
extract-seq-segments action from the RESCRIPt plugin. We'll be extracting the trnL (Leu) amplicon region using the Taberlet et al. 2007 primers (see below). We'll do this by using the new
rescript extract-seq-segments action.
|Primer name||Primer sequence (5' - 3')||Reverse Compliment of Primer|
We'll also be sure to exclude a variety of potentially unhelpful reference sequences. We'll do all of this by using the following entrez search term to download Streptophyta. This is a large set of data, if you'd like to use a smaller data set I recommend using a clade within this group.
txid35493[ORGN] AND (trnL OR tRNA-Leu OR trn-L OR trn L) AND (chloroplast[Filter] OR plastid[Filter]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]
Finally, we'll specify a list of taxonomic ranks that we'd like to extract for each sequence we download from GenBank. To make sure that we have a value for each rank, we'll enable
--p-rank-propagation. See the note on rank propagation within the RESCRIPt SILVA tutorial.
! qiime rescript get-ncbi-data \ --p-query "txid35493[ORGN] AND (trnL OR tRNA-Leu OR trn-L OR trn L) AND (chloroplast[Filter] OR plastid[Filter]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \ --p-ranks kingdom phylum class order family genus species \ --p-rank-propagation \ --p-n-jobs 1 \ --o-sequences trnL-ref-seqs.qza \ --o-taxonomy trnL-ref-tax.qza \ --verbose
Sequence repositories like GenBank, and others, often contain quite a bit of redundant data. For example, different research groups may generate sequence data for the same set of taxa with similar approaches (i.e. identical primers). We'll dereplicate the sequence data to reduce the size of our database, which will also reduce the time spent on all other downstream database curational steps.
! qiime rescript dereplicate \ --i-sequences trnL-ref-seqs.qza \ --i-taxa trnL-ref-tax.qza \ --p-mode 'uniq' \ --p-threads 8 \ --o-dereplicated-sequences trnL-ref-seqs-derep.qza \ --o-dereplicated-taxa trnL-ref-tax-derep.qza
If you already have a pool of reference sequence segments you can skip the following
feature-classifier extract-reads step. If not, we'll proceed to generate an initial reference pool by extracting the region of interest via a PCR primer-pair search.
Using the PCR primer-pair search approach, we'll initially try to extract as many segments from the trnL (Leu) gene sequences that we've downloaded from GenBank. We'll obviously lose quite a large amount of data, as not all of the sequences will contain the primer sequences, as discussed above. Again, this can be fore a few reasons:
- The primers were trimmed / removed from the sequences prior to being deposited in a sequence repository (e.g. GenBank). Thus there will be no "hits" to these sequences using a primer-pair search.
- The sequence spans a region not targeted by either of the primers.
- The sequence only contains a match to one primer sequence (e.g. the sequence may have used a primer pair that is offset from the pair you've selected).
- The sequence is of low quality and finding a match is difficult
! qiime feature-classifier extract-reads \ --i-sequences trnL-ref-seqs-derep.qza \ --p-f-primer GGGCAATCCTGAGCCAA \ --p-r-primer CCATTGAGTCTCTGCACCTATC \ --p-n-jobs 8 \ --o-reads trnL-gh-seq-segments.qza
Tip: Additionally, if you are aware of any previously accessible data within the GenBank SRA that might serve as a good reference sequence pool, consider using q2-fondue. In regard to this tutorial you may consider downloading the trnL amplicon sequences from BioProject PRJNA415437. Then clean and process them as you would any amplicon data set. Then use these reads as a starting point for constructing your reference sequence pool.
We'll proceed to dereplicate and clean our sequences from our target extracted amplicon region. It's quite common for sequence repositories like GenBank to contain sequence data with ambiguous IUPAC nucleotides like
M ..., etc. We can remove such sequences with the
cull-seqs action. The extracted trnL amplicon sequences are very short, because of this we'll be a bit more strict and allow for fewer degeneracies. We'll remove any sequences that contain one or more ambiguous bases. Leaving too many ambiguous bases might increase the chances of extracting spurious segments.
In fact, we'll repeatedly remove reads with ambiguous bases after each iteration of the sequence segment extraction steps that we'll discuss below, as there is a chance we'll recruit the same or other low-quality sequence segments. We'll leave the homopolymer setting as is for now. Feel free to change if needed.
We'll also dereplicate after each cycle to reduce the number of redundant searches. That is, extracted segments from larger reads may be identical. Also, this can reduce file size. In most cases, being this diligent at each cycle of preparing the database is not needed. However, as this is a large dataset for a very short amplicon region, we'll attempt to be as thorough as possible here.
! qiime rescript dereplicate \ --i-sequences trnL-gh-seq-segments.qza \ --i-taxa trnL-ref-tax-derep.qza \ --p-mode 'uniq' \ --p-threads 8 \ --o-dereplicated-sequences trnL-gh-seq-segments-derep.qza \ --o-dereplicated-taxa trnL-gh-tax-segments-derep.qza ! qiime rescript cull-seqs \ --i-sequences trnL-gh-seq-segments-derep.qza \ --p-n-jobs 8 \ --p-num-degenerates 1 \ --p-homopolymer-length 8 \ --o-clean-sequences trnL-gh-seq-segments-derep-cull.qza ! qiime feature-table tabulate-seqs \ --i-data trnL-gh-seq-segments-derep-cull.qza \ --o-visualization trnL-gh-seq-segments-derep-cull.qzv
We'll now use these cleaned reference sequence segments as our new reference-sequence-segment pool. That is, we'll query our retrieved GenBank sequences (the same sequence set from which we initially extracted our target amplicon region) against this PCR primer extracted reference-sequence-segment pool. Any GenBank query sequence that contains a segment that matches to any of our reference-sequence-segments (within a user defined threshold) will be extracted and written to file. This will create a new, and hopefully larger reference-sequence-segment pool, which we will use for another iteration of query segment search and extraction. Again, this approach obviates the need for PCR primer sequences.
Remember when we installed that latest version of vsearch? Well, this is where we'll be using it! Currently we need to avoid installing the latest version of vsearch within the QIIME 2 environment as it will cause conflicts with other QIIME 2 plugins. So, we'll have to install vsearch elsewhere on our systems. Again, do not worry this is in the process of being fixed for a future release of QIIME 2!
We'll work around this issue with a little hack. We'll simply run an
export PATH=.. command prior to using this specific action. That is, in the terminal (or jupyter notebook cell) you'd type:
/path-to-vsearch-2.21.1/bin/ is the absolute path to the vsearch
Tip: if you intend to use other plugins that depend upon vsearch, you'll likely want to run them in a fresh terminal. If you are running within a jupyter notebook as shown below (with
%%bashmagic), then the
exportstatement should only apply to that single notebook cell, and not interfere with other notebook cells (they should use the version of vsearch installed with QIIME 2). Otherwise, you can run
which vsearchin the terminal to determine which version of vsearch is being used.
--p-perc-identity 0.7 for our search. Typically, setting 70% - 90% is ideal. In fact, you can change this setting between iterations. For example, starting the first few iterations at 70% and the last few iterations at 90%-95%. The most conservative approach would be to set 90% for each iteration. This will reduce the chances of extracting spurious sequence segments, but will often require more iterations to increase your reference sequence pool.
%%bash export PATH='/path-to-vsearch-2.21.1/bin/':$PATH ! qiime rescript extract-seq-segments \ --i-input-sequences trnL-ref-seqs-derep.qza \ --i-reference-segment-sequences trnL-gh-seq-segments-derep-cull.qza \ --p-perc-identity 0.7 \ --p-min-seq-len 10 \ --p-threads 8 \ --o-extracted-sequence-segments trnL-gh-extracted-seq-segments-01.qza \ --o-unmatched-sequences trnL-gh-unmatched-sequences-01.qza \ --verbose
Filter again: We'll perform another set clean-up, as with each iteration we'll likely extract sequence segments that are of poor quality or contain many IUPAC ambiguity codes. If we do not remove these at each iteration, especially for these very short trnL gene sequences, we'll increase our retention of very spurious sequence segments.
! qiime rescript dereplicate \ --i-sequences trnL-gh-extracted-seq-segments-01.qza \ --i-taxa trnL-ref-tax-derep.qza \ --p-mode 'uniq' \ --p-threads 8 \ --o-dereplicated-sequences trnL-gh-extracted-seq-segments-derep-01.qza \ --o-dereplicated-taxa trnL-gh-extracted-tax-segments-derep-01.qza ! qiime rescript cull-seqs \ --i-sequences trnL-gh-extracted-seq-segments-derep-01.qza \ --p-n-jobs 8 \ --p-num-degenerates 1 \ --p-homopolymer-length 8 \ --o-clean-sequences trnL-gh-extracted-seq-segments-derep-cull-01.qza ! qiime feature-table tabulate-seqs \ --i-data trnL-gh-extracted-seq-segments-derep-cull-01.qza \ --o-visualization trnL-gh-extracted-seq-segments-derep-cull-01.qzv
If you compare the initial PCR extracted segments from the
feature-classifier extract-reads command, and with a little cleaning, we end up with
trnL-gh-seq-segments-derep-cull.qzv. Compare the sequence counts of that initial step to the first iteration of
rescript extract-seq-segments, i.e.
trnL-gh-extracted-seq-segments-derep-cull-01.qzv, again after a little cleaning. We've increased our reference pool of unique seqeunces by several times, compared to the primer search alone!
This might be good enough, but lets try one more iteration. We do this by now using the
trnL-gh-extracted-seq-segments-derep-cull-01.qza as our new reference-sequence-segment pool and re-run the same
trnL-ref-seqs-derep.qza against it. As you can see each iteration expands our ability to capture more of our target sequence segments of interest.
Let's repeat the same sequence of steps using
rescript extract-seq-segments. For this iteration we'll use
trnL-gh-extracted-seq-segments-derep-cull-01.qza as input to
%%bash export PATH='/path-to-vsearch-2.21.1/bin/':$PATH qiime rescript extract-seq-segments \ --i-input-sequences trnL-ref-seqs-derep.qza \ --i-reference-segment-sequences trnL-gh-extracted-seq-segments-derep-cull-01.qza \ --p-perc-identity 0.7 \ --p-min-seq-len 10 \ --p-threads 8 \ --o-extracted-sequence-segments trnL-gh-extracted-seq-segments-02.qza \ --o-unmatched-sequences trnL-gh-unmatched-sequences-02.qza \ --verbose
Clean the output...
! qiime rescript dereplicate \ --i-sequences trnL-gh-extracted-seq-segments-02.qza \ --i-taxa trnL-ref-tax-derep.qza \ --p-mode 'uniq' \ --p-threads 8 \ --o-dereplicated-sequences trnL-gh-extracted-seq-segments-derep-02.qza \ --o-dereplicated-taxa trnL-gh-extracted-tax-segments-derep-02.qza ! qiime rescript cull-seqs \ --i-sequences trnL-gh-extracted-seq-segments-derep-02.qza \ --p-n-jobs 8 \ --p-num-degenerates 1 \ --p-homopolymer-length 8 \ --o-clean-sequences trnL-gh-extracted-seq-segments-derep-cull-02.qza ! qiime feature-table tabulate-seqs \ --i-data trnL-gh-extracted-seq-segments-derep-cull-02.qza \ --o-visualization trnL-gh-extracted-seq-segments-derep-cull-02.qzv
Whoa! Now we have an even greater pool of unique reference sequences. We'll stop here. Note: if you're able to extract sequence segments from ~50-70% of your original query sequences, you likely have enough to stop the segment extraction iterations and move on to other downstream curation tasks. Through each iteration we are retaining sequence segments that are within a 70% match of the reference pool. As we add more references to the pool we increase our ability to match new sequences. Keep in mind you want to balance the acquisition of as many reference sequence segments as you can, against the potential of extracting too many spurious segments that may decrease the effectiveness of your reference database!
The sequences we find on GenBank may be too short (or too long) for our needs. We can remove such sequences, for our purposes here. We'll make use of RESCRIPt's
filter-seqs-length action. Note, if you are interested in length trimming based on taxonomy you can use
Remember, the primer set from Taberlet et al. 2007, and can produce fragments that vary in length from 10 - 143 bases. But recent data in GenBank suggest up-to ~250 bases. Let's filter around that length. Actually, I do not think this command will filter anything in this case, but we'll do it anyway just as a demonstration of a step to consider.
! qiime rescript filter-seqs-length \ --i-sequences trnL-gh-extracted-seq-segments-derep-cull-02.qza \ --p-global-min 10 \ --p-global-max 250 \ --o-filtered-seqs trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \ --o-discarded-seqs trnL-gh-extracted-seq-segments-derep-cull-keep-discard-02.qza ! qiime rescript filter-taxa \ --i-taxonomy trnL-ref-tax-derep.qza \ --m-ids-to-keep-file trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \ --o-filtered-taxonomy trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza
Let's evaluate our trnL reference database, and generate some visualizations.
! qiime metadata tabulate \ --m-input-file trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \ --o-visualization trnL-gh-extracted-tax-segments-derep-cull-keep-02.qzv ! qiime rescript evaluate-taxonomy \ --i-taxonomies trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \ --o-taxonomy-stats trnL-gh-extracted-tax-segments-derep-cull-keep-eval-02.qzv ! qiime rescript evaluate-seqs \ --i-sequences trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \ --p-kmer-lengths 16 8 4 2 \ --o-visualization trnL-gh-extracted-seq-segments-derep-cull-keep-eval-02.qzv
As you can see, after all of our processing we have quite a bit of trnL amplicon reference sequences compared to when we started. We've also generated a few basic descriptors of these references too. Let's take these and build a classifier, and evaluate it.
We are now ready to construct a classifier for our trnL(Leu) reference sequences. We'll use the
evaluate-fit-classifier, as this will not only make our classifier just like
qiime feature-classifier fit-classifier-naive-bayes, but will also provide an evaluation of our "best-case estimate" of accuracy (i.e., when all query sequences have one or more known matches within our reference database. See our other tutorials for more details.
Note this will take quite a while to complete! You can opt to skip the commands below and simply train the classifier with feature-classifier fit-classifier-naive-bayes.
! qiime rescript evaluate-fit-classifier \ --i-sequences trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \ --i-taxonomy trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \ --p-n-jobs 2 \ --o-classifier ncbi-trnL-gh-refseqs-classifier.qza \ --o-evaluation ncbi-trnL-gh-refseqs-classifier-evaluation.qzv \ --o-observed-taxonomy ncbi-trnL-gh-refseqs-predicted-taxonomy.qza ! qiime rescript evaluate-taxonomy \ --i-taxonomies trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza ncbi-trnL-gh-refseqs-predicted-taxonomy.qza \ --p-labels ref-taxonomy predicted-taxonomy \ --o-taxonomy-stats trnL-gh-ref-taxonomy-evaluation.qzv
This approach should work well for a wide array of other amplicons, e.g.:
- Small-Sub-Unit rRNA (SSU)
- Large-Sub-Unit rRNA (LSU)
- 12S rRNA (12S)
- Cytochrome b (CytoB)
- Cytochrome oxidase subunit 1 (CO1)
- Internal Transcribed Spacer (ITS)
- RuBisCo (rbcL)
- megakaryocyte-associated tyrosine kinase (matk)