Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs.

:construction: Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development. Please feel free to ask questions and provide feedback. Happy :qiime2:ing!


If you use RESCRIPt or any RESCRIPt-processed data in your research, please cite the following:

  • Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581

To install:

The problem:

Often we wish to obtain reference sequences that spans our sequence-region of interest, from an online database such as GenBank. Often this is done by using PCR primer pairs to search for and extract the appropriate sequence segments from a larger sequence record. However, our primer sequences may not be contained within the sequences we retrieve, many of which merely being sequence segments of full-length gene sequences themselves:

As you can see, our amplicon region of interest may be within the sequence set that we've downloaded (blue, green, purple). However, we'll fail to retrieve some of these sequences (blue, orange, & red) when relying solely on PCR primer pair search / extraction. That is, a reference sequence might not contain any of the primers (blue), only contain one of our primers (orange & red), and some that span most of the region of interest, which may still be useful to keep, despite containing only one of the primers (orange).

Fortunately, there is a simple solution to our problem: use the sequences we obtained from a sequence repository as queries against our curated set of reference-sequence-segments (e.g. reference amplicon sequences). When any portion of our query sufficiently aligns to our reference segments, then that segment of our query, or query-segments, will be written to file.

Here we'll use the --qsegout (query segments out) command of vsearch to find sequences that have matching segments to our region of interest. Here we'll provide an example of how RESCRIPt, can make the process of making a custimized reference database much less onerous in these cases.

Expanding our reference pool :swimming_woman:

Another advantage to extracting query-segments, is the ability to widen our pool of reference sequences to construct a robust reference database. That is, if our starting pool of sequence-segments is not very diverse, our ability to extract all relevant query-segments may be sub-optimal. To address this, we can iterate through several query-segment searches to increase the diversity of our reference pool. In brief, we'll use the extracted query-seqments from a prior iteration as our new reference-sequence-segments pool. This new pool will be used to extract a new set of query-segments. We'll provide more details as we proceed through our example below.

Fetch trnL(UAA) gene data :potted_plant: :inbox_tray:

The trnL (Leu) marker gene is commonly used in environmental DNA (eDNA) surveys, e.g. studying the plant diets of feral swine (Robeson et al. 2017), and Bison (Bergmann et al. 2015). This marker gene serves as a great dataset to showcase the new extract-seq-segments action from the RESCRIPt plugin. We'll be extracting the trnL (Leu) amplicon region using the Taberlet et al. 2007 primers (see below). We'll do this by using the new rescript extract-seq-segments action.

Primer name Primer sequence (5' - 3')

We'll use RESCRIPt's get-ncbi-data to fetch only Streptophyta ("green plants") sequences. More information can be found here.

We'll also be sure to exclude a variety of potentially unhelpful reference sequences. We'll do all of this by using the following entrez search term to download Streptophyta. This is a large set of data, if you'd like to use a smaller data set I recommend using a clade within this group.

txid35493[ORGN] AND (trnL OR tRNA-Leu OR trn-L OR trn L) AND (chloroplast[Filter] OR plastid[Filter]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]

Finally, we'll specify a list of taxonomic ranks that we'd like to extract for each sequence we download from GenBank. To make sure that we have a value for each rank, we'll enable --p-rank-propagation. See the note on rank propagation within the RESCRIPt SILVA tutorial.

! qiime rescript get-ncbi-data \
    --p-query "txid35493[ORGN] AND (trnL OR tRNA-Leu OR trn-L OR trn L) AND (chloroplast[Filter] OR plastid[Filter]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \
    --p-ranks  kingdom phylum class order family genus species \
    --p-rank-propagation \
    --p-n-jobs 1 \
    --o-sequences trnL-ref-seqs.qza \
    --o-taxonomy trnL-ref-tax.qza \

Dereplicate data :ten: :arrow_forward: :one:

Sequence repositories like GenBank, and others, often contain quite a bit of redundant data. For example, different research groups may generate sequence data for the same set of taxa with similar approaches (i.e. identical primers). We'll dereplicate the sequence data to reduce the size of our database, which will also reduce the time spent on all other downstream database curational steps.

! qiime rescript dereplicate \
    --i-sequences trnL-ref-seqs.qza \
    --i-taxa trnL-ref-tax.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences trnL-ref-seqs-derep.qza \
    --o-dereplicated-taxa trnL-ref-tax-derep.qza

Optional: Generate an initial reference pool of sequence segments using primer-pair search :sailboat:

If you already have a reference segment sequence pool available, you can skip this primer pair extraction step and proceed to the Expanding our reference segment sequence pool portion of this tutorial:

  • For example, if we were working with the 16S rRNA marker gene, we could use the silva-138-99-seqs-515-806.qza file available from the Data resources page, as our reference sequence segments to extract the V4 region from the a set of full-length 16S rRNA sequences.
  • If you are aware of any previously accessible data within the GenBank SRA that might serve as a good reference sequence pool, consider using q2-fondue. In regard to this tutorial you may consider downloading the trnL amplicon sequences from BioProject PRJNA415437. Then clean and process them as you would any amplicon data set. Then use these reads as a starting point for constructing your reference sequence pool.

Otherwise, you can generate an initial reference pool by extracting the region of interest via a PCR primer-pair search. Using the PCR primer-pair search approach, we'll initially try to extract as many segments from the trnL (Leu) gene sequences that we've downloaded from GenBank. We'll obviously lose quite a large amount of data, as not all of the sequences will contain the primer sequences, as discussed above. Again, this can be fore a few reasons:

  • The primers were trimmed / removed from the sequences prior to being deposited in a sequence repository (e.g. GenBank). Thus there will be no "hits" to these sequences using a primer-pair search.
  • The sequence spans a region not targeted by either of the primers.
  • The sequence only contains a match to one primer sequence (e.g. the sequence may have used a primer pair that is offset from the pair you've selected).
  • The sequence is of low quality and finding a match is difficult
! qiime feature-classifier extract-reads \
   --i-sequences trnL-ref-seqs-derep.qza \
   --p-f-primer GGGCAATCCTGAGCCAA \
   --p-n-jobs 8 \
   --o-reads trnL-gh-seq-segments.qza

Removing low-quality sequences :scissors:

We'll proceed to dereplicate and clean our sequences from our target extracted amplicon region. It's quite common for sequence repositories like GenBank to contain sequence data with ambiguous IUPAC nucleotides like N, R, Y, M ..., etc. We can remove such sequences with the cull-seqs action. The extracted trnL amplicon sequences are very short, because of this we'll be a bit more strict and allow for fewer degeneracies. We'll remove any sequences that contain one or more ambiguous bases. Leaving too many ambiguous bases might increase the chances of extracting spurious segments.

In fact, we'll repeatedly remove reads with ambiguous bases after each iteration of the sequence segment extraction steps that we'll discuss below, as there is a chance we'll recruit the same or other low-quality sequence segments. We'll leave the homopolymer setting as is for now. Feel free to change if needed.

We'll also dereplicate after each cycle to reduce the number of redundant searches. That is, extracted segments from larger reads may be identical. Also, this can reduce file size. In most cases, being this diligent at each cycle of preparing the database is not needed. However, as this is a large dataset for a very short amplicon region, we'll attempt to be as thorough as possible here.

! qiime rescript dereplicate \
    --i-sequences trnL-gh-seq-segments.qza \
    --i-taxa trnL-ref-tax-derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences trnL-gh-seq-segments-derep.qza \
    --o-dereplicated-taxa trnL-gh-tax-segments-derep.qza
! qiime rescript cull-seqs \
    --i-sequences trnL-gh-seq-segments-derep.qza  \
    --p-n-jobs 8 \
    --p-num-degenerates 1 \
    --p-homopolymer-length 8 \
    --o-clean-sequences trnL-gh-seq-segments-derep-cull.qza

! qiime feature-table tabulate-seqs \
  --i-data trnL-gh-seq-segments-derep-cull.qza \
  --o-visualization trnL-gh-seq-segments-derep-cull.qzv

Expanding our reference segment sequence pool :chart_with_upwards_trend:

We'll now use these cleaned reference sequence segments as our reference-segment-sequences pool. That is, we'll query our retrieved GenBank sequences (the same sequence set from which we initially extracted our target amplicon region) against this PCR primer extracted reference-segment-sequences pool. Any GenBank query sequence that contains a segment that matches to any of our reference-segment-sequences (within a user defined threshold) will be extracted and written to file. This will create a new, and hopefully larger reference-segment-sequence pool, which we will use for another iteration of query segment search and extraction. Again, this approach obviates the need for PCR primer sequences.

We are essentially following the approach originally developed by Edgar (2010 & 2016) as outlined here, and was applied in Robeson et al. 2017 (which we are emulating here).

First iteration of query sequence segment extraction :curly_loop:

We'll use --p-perc-identity 0.7 for our search. Typically, setting 70% - 90% is ideal. In fact, you can change this setting between iterations. For example, starting the first few iterations at 70% and the last few iterations at 90%-95%. The most conservative approach would be to set 90% for each iteration. This will reduce the chances of extracting spurious sequence segments, but will often require more iterations to increase your reference sequence pool.

! qiime rescript extract-seq-segments \
    --i-input-sequences trnL-ref-seqs-derep.qza \
    --i-reference-segment-sequences trnL-gh-seq-segments-derep-cull.qza \
    --p-perc-identity 0.7 \
    --p-min-seq-len 10 \
    --p-threads 8 \
    --o-extracted-sequence-segments trnL-gh-extracted-seq-segments-01.qza \
    --o-unmatched-sequences trnL-gh-unmatched-sequences-01.qza \

Filter again: We'll perform another set clean-up, as with each iteration we'll likely extract sequence segments that are of poor quality or contain many IUPAC ambiguity codes. If we do not remove these at each iteration, especially for these very short trnL gene sequences, we'll increase our retention of very spurious sequence segments.

! qiime rescript dereplicate \
    --i-sequences trnL-gh-extracted-seq-segments-01.qza \
    --i-taxa trnL-ref-tax-derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences trnL-gh-extracted-seq-segments-derep-01.qza \
    --o-dereplicated-taxa trnL-gh-extracted-tax-segments-derep-01.qza

! qiime rescript cull-seqs \
    --i-sequences trnL-gh-extracted-seq-segments-derep-01.qza \
    --p-n-jobs 8 \
    --p-num-degenerates 1 \
    --p-homopolymer-length 8 \
    --o-clean-sequences trnL-gh-extracted-seq-segments-derep-cull-01.qza

! qiime feature-table tabulate-seqs \
  --i-data trnL-gh-extracted-seq-segments-derep-cull-01.qza \
  --o-visualization trnL-gh-extracted-seq-segments-derep-cull-01.qzv

If you compare the initial PCR extracted segments from the feature-classifier extract-reads command, and with a little cleaning, we end up with trnL-gh-seq-segments-derep-cull.qzv. Compare the sequence counts of that initial step to the first iteration of rescript extract-seq-segments, i.e. trnL-gh-extracted-seq-segments-derep-cull-01.qzv, again after a little cleaning. We've increased our reference pool of unique seqeunces by several times, compared to the primer search alone!

This might be good enough, but lets try one more iteration. We do this by now using the trnL-gh-extracted-seq-segments-derep-cull-01.qza as our new reference-sequence-segment pool and re-run the same trnL-ref-seqs-derep.qza against it. As you can see each iteration expands our ability to capture more of our target sequence segments of interest.

The second iteration. :loop:

Let's repeat the same sequence of steps using rescript extract-seq-segments. For this iteration we'll use trnL-gh-extracted-seq-segments-derep-cull-01.qza as input to --i-reference-segment-sequences.

! qiime rescript extract-seq-segments \
    --i-input-sequences trnL-ref-seqs-derep.qza \
    --i-reference-segment-sequences trnL-gh-extracted-seq-segments-derep-cull-01.qza \
    --p-perc-identity 0.7 \
    --p-min-seq-len 10 \
    --p-threads 8 \
    --o-extracted-sequence-segments trnL-gh-extracted-seq-segments-02.qza \
    --o-unmatched-sequences trnL-gh-unmatched-sequences-02.qza \

Clean the output...

! qiime rescript dereplicate \
    --i-sequences trnL-gh-extracted-seq-segments-02.qza \
    --i-taxa trnL-ref-tax-derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences trnL-gh-extracted-seq-segments-derep-02.qza \
    --o-dereplicated-taxa trnL-gh-extracted-tax-segments-derep-02.qza

! qiime rescript cull-seqs \
    --i-sequences trnL-gh-extracted-seq-segments-derep-02.qza \
    --p-n-jobs 8 \
    --p-num-degenerates 1 \
    --p-homopolymer-length 8 \
    --o-clean-sequences trnL-gh-extracted-seq-segments-derep-cull-02.qza

! qiime feature-table tabulate-seqs \
  --i-data trnL-gh-extracted-seq-segments-derep-cull-02.qza \
  --o-visualization trnL-gh-extracted-seq-segments-derep-cull-02.qzv

:tada: Whoa! Now we have an even greater pool of unique reference sequences. We'll stop here. Note: if you're able to extract sequence segments from ~50-70% of your original query sequences, you likely have enough to stop the segment extraction iterations and move on to other downstream curation tasks. Through each iteration we are retaining sequence segments that are within a 70% match of the reference pool. As we add more references to the pool we increase our ability to match new sequences. Keep in mind you want to balance the acquisition of as many reference sequence segments as you can, against the potential of extracting too many spurious segments that may decrease the effectiveness of your reference database!

Optional curation :thinking:

Filter by sequence length :straight_ruler:

The sequences we find on GenBank may be too short (or too long) for our needs. We can remove such sequences, for our purposes here. We'll make use of RESCRIPt's filter-seqs-length action. Note, if you are interested in length trimming based on taxonomy you can use filter-seqs-length-by-taxon.

Remember, the primer set from Taberlet et al. 2007, and can produce fragments that vary in length from 10 - 143 bases. But recent data in GenBank suggest up-to ~250 bases. Let's filter around that length. Actually, I do not think this command will filter anything in this case, but we'll do it anyway just as a demonstration of a step to consider.

! qiime rescript filter-seqs-length \
    --i-sequences trnL-gh-extracted-seq-segments-derep-cull-02.qza \
    --p-global-min 10 \
    --p-global-max 250 \
    --o-filtered-seqs trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
    --o-discarded-seqs trnL-gh-extracted-seq-segments-derep-cull-keep-discard-02.qza

! qiime rescript filter-taxa \
    --i-taxonomy trnL-ref-tax-derep.qza \
    --m-ids-to-keep-file trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
    --o-filtered-taxonomy trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza

Let's take a look at what we've got! :eyes:

Let's evaluate our trnL reference database, and generate some visualizations.

! qiime metadata tabulate \
    --m-input-file trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \
    --o-visualization trnL-gh-extracted-tax-segments-derep-cull-keep-02.qzv
! qiime rescript evaluate-taxonomy \
    --i-taxonomies trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \
    --o-taxonomy-stats trnL-gh-extracted-tax-segments-derep-cull-keep-eval-02.qzv

! qiime rescript evaluate-seqs \
    --i-sequences trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
    --p-kmer-lengths 16 8 4 2 \
    --o-visualization trnL-gh-extracted-seq-segments-derep-cull-keep-eval-02.qzv

As you can see, after all of our processing we have quite a bit of trnL amplicon reference sequences compared to when we started. We've also generated a few basic descriptors of these references too. Let's take these and build a classifier, and evaluate it.

Build and evaluate our classifier :building_construction: :white_check_mark:

We are now ready to construct a classifier for our trnL(Leu) reference sequences. We'll use the evaluate-fit-classifier, as this will not only make our classifier just like qiime feature-classifier fit-classifier-naive-bayes, but will also provide an evaluation of our "best-case estimate" of accuracy (i.e., when all query sequences have one or more known matches within our reference database. See our other tutorials for more details.

Note this will take quite a while to complete! You can opt to skip the commands below and simply train the classifier with feature-classifier fit-classifier-naive-bayes.

! qiime rescript evaluate-fit-classifier \
    --i-sequences trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza  \
    --i-taxonomy trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \
    --p-n-jobs 2 \
    --o-classifier ncbi-trnL-gh-refseqs-classifier.qza \
    --o-evaluation ncbi-trnL-gh-refseqs-classifier-evaluation.qzv \
    --o-observed-taxonomy ncbi-trnL-gh-refseqs-predicted-taxonomy.qza

! qiime rescript evaluate-taxonomy \
  --i-taxonomies trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza ncbi-trnL-gh-refseqs-predicted-taxonomy.qza \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats trnL-gh-ref-taxonomy-evaluation.qzv

Try other amplicons! :taco:

This approach should work well for a wide array of other amplicons, e.g.:

  • Small-Sub-Unit rRNA (SSU)
  • Large-Sub-Unit rRNA (LSU)
  • 12S rRNA (12S)
  • Cytochrome b (CytoB)
  • Cytochrome oxidase subunit 1 (CO1)
  • Internal Transcribed Spacer (ITS)
  • RuBisCo (rbcL)
  • megakaryocyte-associated tyrosine kinase (matk)
  • etc...