Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs.

SoilRotifer · July 13, 2022, 4:18pm

Please consider this tutorial a living document, which may change based upon community feedback and ongoing plugin development. Please feel free to ask questions and provide feedback. Happy :qiime2:ing!

Citation:

If you use either RESCRIPt and/or vsearch to process your data, as outlined in this tutorial, please cite the following:

Michael S Robeson II, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich. 2021. "RESCRIPt: Reproducible sequence taxonomy reference database management". PLoS Computational Biology 17 (11): e1009581.; doi: 10.1371/journal.pcbi.1009581
Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. 2016. “VSEARCH: A Versatile Open Source Tool for Metagenomics.” PeerJ 4 (October): e2584. doi: 10.7717/peerj.2584.

To install:

Make sure you are using QIIME 2 version 2023.2 or later:
- Install the latest version of RESCRIPt.
- Proceed to the tutorial.

The problem:

Often we wish to obtain reference sequences that spans our sequence-region of interest, from an online database such as GenBank. Often this is done by using PCR primer pairs to search for and extract the appropriate sequence segments from a larger sequence record. However, our primer sequences may not be contained within the sequences we retrieve, many of which merely being sequence segments of full-length gene sequences themselves:

As you can see, our amplicon region of interest may be within the sequence set that we've downloaded (blue, green, purple). However, we'll fail to retrieve some of these sequences (blue, orange, & red) when relying solely on PCR primer pair search / extraction. That is, a reference sequence might not contain any of the primers (blue), only contain one of our primers (orange & red), and some that span most of the region of interest, which may still be useful to keep, despite containing only one of the primers (orange).

Fortunately, there is a simple solution to our problem: use the sequences we obtained from a sequence repository as queries against our curated set of reference-sequence-segments (e.g. reference amplicon sequences). When any portion of our query sufficiently aligns to our reference segments, then that segment of our query, or query-segments, will be written to file.

Here we'll use the --qsegout (query segments out) command of vsearch to find sequences that have matching segments to our region of interest. Here we'll provide an example of how RESCRIPt, can make the process of making a custimized reference database much less onerous in these cases.

Expanding our reference pool

Another advantage to extracting query-segments, is the ability to widen our pool of reference sequences to construct a robust reference database. That is, if our starting pool of sequence-segments is not very diverse, our ability to extract all relevant query-segments may be sub-optimal. To address this, we can iterate through several query-segment searches to increase the diversity of our reference pool. In brief, we'll use the extracted query-seqments from a prior iteration as our new reference-sequence-segments pool. This new pool will be used to extract a new set of query-segments. We'll provide more details as we proceed through our example below.

Fetch trnL(UAA) gene data

The trnL (Leu) marker gene is commonly used in environmental DNA (eDNA) surveys, e.g. studying the plant diets of feral swine (Robeson et al. 2017), and Bison (Bergmann et al. 2015). This marker gene serves as a great dataset to showcase the new extract-seq-segments action from the RESCRIPt plugin. We'll be extracting the trnL (Leu) amplicon region using the Taberlet et al. 2007 primers (see below). We'll do this by using the new rescript extract-seq-segments action.

Primer name	Primer sequence (5' - 3')
trnL-g	GGGCAATCCTGAGCCAA
trnL-h	CCATTGAGTCTCTGCACCTATC

We'll use RESCRIPt's get-ncbi-data to fetch only Streptophyta ("green plants") sequences. More information can be found here.

We'll also be sure to exclude a variety of potentially unhelpful reference sequences. We'll do all of this by using the following entrez search term to download Streptophyta. This is a large set of data, if you'd like to use a smaller data set I recommend using a clade within this group.

txid35493[ORGN] AND (trnL OR tRNA-Leu OR trn-L OR trn L) AND (chloroplast[Filter] OR plastid[Filter]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]

Finally, we'll specify a list of taxonomic ranks that we'd like to extract for each sequence we download from GenBank. To make sure that we have a value for each rank, we'll enable --p-rank-propagation. See the note on rank propagation within the RESCRIPt SILVA tutorial.

! qiime rescript get-ncbi-data \
    --p-query "txid35493[ORGN] AND (trnL OR tRNA-Leu OR trn-L OR trn L) AND (chloroplast[Filter] OR plastid[Filter]) NOT environmental sample[Filter] NOT environmental samples[Filter] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \
    --p-ranks  kingdom phylum class order family genus species \
    --p-rank-propagation \
    --p-n-jobs 1 \
    --o-sequences trnL-ref-seqs.qza \
    --o-taxonomy trnL-ref-tax.qza \
    --verbose

Dereplicate data

Sequence repositories like GenBank, and others, often contain quite a bit of redundant data. For example, different research groups may generate sequence data for the same set of taxa with similar approaches (i.e. identical primers). We'll dereplicate the sequence data to reduce the size of our database, which will also reduce the time spent on all other downstream database curational steps.

! qiime rescript dereplicate \
    --i-sequences trnL-ref-seqs.qza \
    --i-taxa trnL-ref-tax.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences trnL-ref-seqs-derep.qza \
    --o-dereplicated-taxa trnL-ref-tax-derep.qza

Optional: Generate an initial reference pool of sequence segments using primer-pair search

If you already have a reference segment sequence pool available, you can skip this primer pair extraction step and proceed to the Expanding our reference segment sequence pool portion of this tutorial:

For example, if we were working with the 16S rRNA marker gene, we could use the silva-138-99-seqs-515-806.qza file available from the Data resources page, as our reference sequence segments to extract the V4 region from the a set of full-length 16S rRNA sequences.
If you are aware of any previously accessible data within the GenBank SRA that might serve as a good reference sequence pool, consider using q2-fondue. In regard to this tutorial you may consider downloading the trnL amplicon sequences from BioProject PRJNA415437. Then clean and process them as you would any amplicon data set. Then use these reads as a starting point for constructing your reference sequence pool.

Otherwise, you can generate an initial reference pool by extracting the region of interest via a PCR primer-pair search. Using the PCR primer-pair search approach, we'll initially try to extract as many segments from the trnL (Leu) gene sequences that we've downloaded from GenBank. We'll obviously lose quite a large amount of data, as not all of the sequences will contain the primer sequences, as discussed above. Again, this can be for a few reasons:

The primers were trimmed / removed from the sequences prior to being deposited in a sequence repository (e.g. GenBank). Thus there will be no "hits" to these sequences using a primer-pair search.
The sequence spans a region not targeted by either of the primers.
The sequence only contains a match to one primer sequence (e.g. the sequence may have used a primer pair that is offset from the pair you've selected).
The sequence is of low quality and finding a match is difficult

! qiime feature-classifier extract-reads \
   --i-sequences trnL-ref-seqs-derep.qza \
   --p-f-primer GGGCAATCCTGAGCCAA \
   --p-r-primer CCATTGAGTCTCTGCACCTATC \
   --p-min-length 10 \
   --p-max-length 250 \
   --p-n-jobs 8 \
   --o-reads trnL-gh-seq-segments.qza

Tip: Our expected amplicon length is ~ 143 bases, give or take (see later in the tutorial). But it's often a good idea to set some length limits around this to provide some 'breathing room'. How much is debatable, as the length of this gene can vary substantially depending on the target organisms of interest. For example, some organisms can have a mean trnL sequence length upwards of 700 bp! Adjust accordingly.

Removing low-quality sequences

We'll proceed to dereplicate and clean our sequences from our target extracted amplicon region. It's quite common for sequence repositories like GenBank to contain sequence data with ambiguous IUPAC nucleotides like N, R, Y, M ..., etc. We can remove such sequences with the cull-seqs action. The extracted trnL amplicon sequences are very short, because of this we'll be a bit more strict and allow for fewer degeneracies. We'll remove any sequences that contain one or more ambiguous bases. Leaving too many ambiguous bases might increase the chances of extracting spurious segments.

In fact, we'll repeatedly remove reads with ambiguous bases after each iteration of the sequence segment extraction steps that we'll discuss below, as there is a chance we'll recruit the same or other low-quality sequence segments. We'll leave the homopolymer setting as is for now. Feel free to change if needed.

We'll also dereplicate after each cycle to reduce the number of redundant searches. That is, extracted segments from larger reads may be identical. Also, this can reduce file size. In most cases, being this diligent at each cycle of preparing the database is not needed. However, as this is a large dataset for a very short amplicon region, we'll attempt to be as thorough as possible here.

! qiime rescript dereplicate \
    --i-sequences trnL-gh-seq-segments.qza \
    --i-taxa trnL-ref-tax-derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences trnL-gh-seq-segments-derep.qza \
    --o-dereplicated-taxa trnL-gh-tax-segments-derep.qza
    
! qiime rescript cull-seqs \
    --i-sequences trnL-gh-seq-segments-derep.qza  \
    --p-n-jobs 8 \
    --p-num-degenerates 1 \
    --p-homopolymer-length 8 \
    --o-clean-sequences trnL-gh-seq-segments-derep-cull.qza

! qiime feature-table tabulate-seqs \
  --i-data trnL-gh-seq-segments-derep-cull.qza \
  --o-visualization trnL-gh-seq-segments-derep-cull.qzv

Tip: You can further limit the length of segments you initially extract by adding --p-min-length and --p-max-length to the above dereplicate command.

Expanding our reference segment sequence pool

We'll now use these cleaned reference sequence segments as our reference-segment-sequences pool. That is, we'll query our retrieved GenBank sequences (the same sequence set from which we initially extracted our target amplicon region) against this PCR primer extracted reference-segment-sequences pool. Any GenBank query sequence that contains a segment that matches to any of our reference-segment-sequences (within a user defined threshold) will be extracted and written to file. This will create a new, and hopefully larger reference-segment-sequence pool, which we will use for another iteration of query segment search and extraction. Again, this approach obviates the need for PCR primer sequences.

We are essentially following the approach originally developed by Edgar (2010 & 2016) as outlined here, and was applied in Robeson et al. 2017 (which we are emulating here).

First iteration of query sequence segment extraction

We'll use --p-perc-identity 0.7 for our search. Typically, setting 70% - 90% is ideal. In fact, you can change this setting between iterations. For example, starting the first few iterations at 70% and the last few iterations at 90%-95%. The most conservative approach would be to set 90% for each iteration. This will reduce the chances of extracting spurious sequence segments, but will often require more iterations to increase your reference sequence pool.

! qiime rescript extract-seq-segments \
    --i-input-sequences trnL-ref-seqs-derep.qza \
    --i-reference-segment-sequences trnL-gh-seq-segments-derep-cull.qza \
    --p-perc-identity 0.7 \
    --p-min-seq-len 10 \
    --p-threads 8 \
    --o-extracted-sequence-segments trnL-gh-extracted-seq-segments-01.qza \
    --o-unmatched-sequences trnL-gh-unmatched-sequences-01.qza \
    --verbose

Filter again: We'll perform another set clean-up, as with each iteration we'll likely extract sequence segments that are of poor quality or contain many IUPAC ambiguity codes. If we do not remove these at each iteration, especially for these very short trnL gene sequences, we'll increase our retention of very spurious sequence segments.

! qiime rescript dereplicate \
    --i-sequences trnL-gh-extracted-seq-segments-01.qza \
    --i-taxa trnL-ref-tax-derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences trnL-gh-extracted-seq-segments-derep-01.qza \
    --o-dereplicated-taxa trnL-gh-extracted-tax-segments-derep-01.qza

! qiime rescript cull-seqs \
    --i-sequences trnL-gh-extracted-seq-segments-derep-01.qza \
    --p-n-jobs 8 \
    --p-num-degenerates 1 \
    --p-homopolymer-length 8 \
    --o-clean-sequences trnL-gh-extracted-seq-segments-derep-cull-01.qza

! qiime feature-table tabulate-seqs \
  --i-data trnL-gh-extracted-seq-segments-derep-cull-01.qza \
  --o-visualization trnL-gh-extracted-seq-segments-derep-cull-01.qzv

If you compare the initial PCR extracted segments from the feature-classifier extract-reads command, and with a little cleaning, we end up with trnL-gh-seq-segments-derep-cull.qzv. Compare the sequence counts of that initial step to the first iteration of rescript extract-seq-segments, i.e. trnL-gh-extracted-seq-segments-derep-cull-01.qzv, again after a little cleaning. We've increased our reference pool of unique seqeunces by several times, compared to the primer search alone!

This might be good enough, but lets try one more iteration. We do this by now using the trnL-gh-extracted-seq-segments-derep-cull-01.qza as our new reference-sequence-segment pool and re-run the same trnL-ref-seqs-derep.qza against it. As you can see each iteration expands our ability to capture more of our target sequence segments of interest.

The second iteration.

Let's repeat the same sequence of steps using rescript extract-seq-segments. For this iteration we'll use trnL-gh-extracted-seq-segments-derep-cull-01.qza as input to --i-reference-segment-sequences.

! qiime rescript extract-seq-segments \
    --i-input-sequences trnL-ref-seqs-derep.qza \
    --i-reference-segment-sequences trnL-gh-extracted-seq-segments-derep-cull-01.qza \
    --p-perc-identity 0.7 \
    --p-min-seq-len 10 \
    --p-threads 8 \
    --o-extracted-sequence-segments trnL-gh-extracted-seq-segments-02.qza \
    --o-unmatched-sequences trnL-gh-unmatched-sequences-02.qza \
    --verbose

Clean the output...

! qiime rescript dereplicate \
    --i-sequences trnL-gh-extracted-seq-segments-02.qza \
    --i-taxa trnL-ref-tax-derep.qza \
    --p-mode 'uniq' \
    --p-threads 8 \
    --o-dereplicated-sequences trnL-gh-extracted-seq-segments-derep-02.qza \
    --o-dereplicated-taxa trnL-gh-extracted-tax-segments-derep-02.qza

! qiime rescript cull-seqs \
    --i-sequences trnL-gh-extracted-seq-segments-derep-02.qza \
    --p-n-jobs 8 \
    --p-num-degenerates 1 \
    --p-homopolymer-length 8 \
    --o-clean-sequences trnL-gh-extracted-seq-segments-derep-cull-02.qza

! qiime feature-table tabulate-seqs \
  --i-data trnL-gh-extracted-seq-segments-derep-cull-02.qza \
  --o-visualization trnL-gh-extracted-seq-segments-derep-cull-02.qzv

Whoa! Now we have an even greater pool of unique reference sequences. We'll stop here. Note: if you're able to extract sequence segments from ~50-70% of your original query sequences, you likely have enough to stop the segment extraction iterations and move on to other downstream curation tasks. Through each iteration we are retaining sequence segments that are within a 70% match of the reference pool. As we add more references to the pool we increase our ability to match new sequences. Keep in mind you want to balance the acquisition of as many reference sequence segments as you can, against the potential of extracting too many spurious segments that may decrease the effectiveness of your reference database!

Optional curation

Filter by sequence length

The sequences we find on GenBank may be too short (or too long) for our needs. We can remove such sequences, for our purposes here. We'll make use of RESCRIPt's filter-seqs-length action. Note, if you are interested in length trimming based on taxonomy you can use filter-seqs-length-by-taxon.

Remember, the primer set from Taberlet et al. 2007, and can produce fragments that vary in length from 10 - 143 bases. But recent data in GenBank suggest up-to ~250 bases. Let's filter around that length, for the sake of example. Again, be warned that some organisms can have a mean trnL sequence length upwards of 700 bp! Adjust length trimming accordingly!

! qiime rescript filter-seqs-length \
    --i-sequences trnL-gh-extracted-seq-segments-derep-cull-02.qza \
    --p-global-min 10 \
    --p-global-max 250 \
    --o-filtered-seqs trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
    --o-discarded-seqs trnL-gh-extracted-seq-segments-derep-cull-keep-discard-02.qza

! qiime rescript filter-taxa \
    --i-taxonomy trnL-ref-tax-derep.qza \
    --m-ids-to-keep-file trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
    --o-filtered-taxonomy trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza

Let's take a look at what we've got!

Let's evaluate our trnL reference database, and generate some visualizations.

! qiime metadata tabulate \
    --m-input-file trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \
    --o-visualization trnL-gh-extracted-tax-segments-derep-cull-keep-02.qzv
    
! qiime rescript evaluate-taxonomy \
    --i-taxonomies trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \
    --o-taxonomy-stats trnL-gh-extracted-tax-segments-derep-cull-keep-eval-02.qzv

! qiime rescript evaluate-seqs \
    --i-sequences trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza \
    --p-kmer-lengths 16 8 4 2 \
    --o-visualization trnL-gh-extracted-seq-segments-derep-cull-keep-eval-02.qzv

As you can see, after all of our processing we have quite a bit of trnL amplicon reference sequences compared to when we started. We've also generated a few basic descriptors of these references too. Let's take these and build a classifier, and evaluate it.

Build and evaluate our classifier

We are now ready to construct a classifier for our trnL(Leu) reference sequences. We'll use the evaluate-fit-classifier, as this will not only make our classifier just like qiime feature-classifier fit-classifier-naive-bayes, but will also provide an evaluation of our "best-case estimate" of accuracy (i.e., when all query sequences have one or more known matches within our reference database. See our other tutorials for more details.

Note this will take quite a while to complete! You can opt to skip the commands below and simply train the classifier with feature-classifier fit-classifier-naive-bayes.

! qiime rescript evaluate-fit-classifier \
    --i-sequences trnL-gh-extracted-seq-segments-derep-cull-keep-02.qza  \
    --i-taxonomy trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza \
    --p-n-jobs 2 \
    --o-classifier ncbi-trnL-gh-refseqs-classifier.qza \
    --o-evaluation ncbi-trnL-gh-refseqs-classifier-evaluation.qzv \
    --o-observed-taxonomy ncbi-trnL-gh-refseqs-predicted-taxonomy.qza

! qiime rescript evaluate-taxonomy \
  --i-taxonomies trnL-gh-extracted-tax-segments-derep-cull-keep-02.qza ncbi-trnL-gh-refseqs-predicted-taxonomy.qza \
  --p-labels ref-taxonomy predicted-taxonomy \
  --o-taxonomy-stats trnL-gh-ref-taxonomy-evaluation.qzv

Try other amplicons!

This approach should work well for a wide array of other amplicons, e.g.:

Small-Sub-Unit rRNA (SSU)
Large-Sub-Unit rRNA (LSU)
12S rRNA (12S)
Cytochrome b (CytoB)
Cytochrome oxidase subunit 1 (CO1)
Internal Transcribed Spacer (ITS)
RuBisCo (rbcL)
megakaryocyte-associated tyrosine kinase (matk)
etc...

Nicholas_Bokulich · November 9, 2022, 1:14pm

An off-topic reply has been split into a new topic: RESCRIPt get-ncbi-data: is there a way to specify the users NCBI key?

Please keep replies on-topic in the future.

Nicholas_Bokulich · February 8, 2023, 3:55pm

2 off-topic replies have been split into a new topic: error while running RESCRIPt

Please keep replies on-topic in the future.

John · October 26, 2023, 4:10pm

Howdy! I am curious if you have any thoughts about the following outcome after running this script to build a few different DB's (rbcL, ITS2, CO1) and if you think what I'm getting is expected. For each of these databases, I extracted the region of interest using the primers and then proceeded to the pool expansion steps. I of course end up with a pretty small pool during the initial extraction with the primers. I then move to the expansion step and am starting conservative (perc identity of 0.90). I'm finding that by the third iteration of pool expansion the average length of the reference sequences goes from roughly the size I would expect for the amplicon (181 in the case of our anml primers) to upwards of a 1000 by the time I reach iteration three using 0.9 for each step. Moreover, it isn't just a few sequences that are now 1000; it's 75% or more of the database. This region does not vary significantly, so it seems worrisome. I'm curious if that is what you would expect? Do you think if I filter seqs based on what I would anticipate the max length to be in this database that I would be OK? I'm just worried at how large the sequences have gotten. Perhaps I don't clearly understand how it's working and that I shouldn't worry about these results. In fact, that's what I'm hoping!

SoilRotifer · October 26, 2023, 4:30pm

Great question! You are right to be concerned about this. This is exactly why I state in the tutorial:

This can become more of a problem if you are not cleaning the output after each iteration. The more ambiguous IUPAC bases you have the more spurious things can become. I've built databases for these amplicons (rbcL, ITS2, CO1) too, and have had to filter based on length etc.. Thus, I'd recommend checking a few of these spuriously long sequences via online BLAST, etc.

That being said...

It is okay if there is variation, as some of the data in GenBank might not be of high quality. Also, it could be, after these iterations, that you are at the point where you are extracting the fuller-length sequences of the marker genes you are searching for. If this is the case, then it might not be an issue.

Again, as you've noted, the initial PCR primer pair extracts only the amplicon region, the later iterations could expand your reference pool to sequences that contain a longer portion of your marker gene, in which the primer sequences are not a great match (which is why they were not extracted initially), as expected given primer biases etc...

Finally, you can consider starting with a 90% cutoff, then increase by 2-5 % after each iteration, e.g. 90, 95, 97, ...

Hopefully, this helps.

John · November 14, 2023, 3:45pm

Hi Mike,

This is indeed helpful and I appreciate your prompt feedback. I am curious what you use as a length cutoff for your rbcL database (if you have that info or recall)? The first iteration at 70% yields predominantly the size that we would expect from our primer set (334), but on the second iteration at 80% we start to see some rather large fragments (near 1k). They seem to be "real" when I blast, but not sure if you would advise cutting them off given that they are substantially larger than expected.

Cheers

SoilRotifer · November 14, 2023, 6:40pm

No problem.

If I remember correctly the rbcL gene is ~ 1400 bp. At least according to this.

Given my explanation above, you are likely starting to extract more full-length rbcL sequences as you iterate. Basically, you are getting leaky length extractions as you iterate. Which might be fine, especially if your manual BLAST confirms the gene and taxonomy assignments.

But if you are hoping to strictly stay within the bounds of your expected amplicon length, then Id simply do more iterations, at a much higher similarity, say 90% then go up.

One thought, you could follow up with a primer-pair search on the last extract sequence segment output. You might be able to extract more of your target amplicons when searching against this smaller pool of ~ 1kb fragments. Perhaps at this stage modify your primer pairs with more IUPAC ambiguity codes at this step?

Alternatively, you can align your extracted sequences with mafft using qiime alignment mafft ..., then add your primers to that alignment using qiime alignment mafft-add --p-addfragments .... Then you can export the resulting FASTA file into an alignment viewer of your choice. You'll use the alignment to see which alignment positions your primers are located. Once you make note of these positions, you can use qiime rescript trim-alignment --p-position-start XX --p-position-end YY ... to extract that section of the alignment and write it out to file. Then you can use qiime rescript degap-seqs ... to remove all the gaps, essentially unaligning the sequences for use in training a classifier.

Here is an old prototype pipeline I put together prior to RESCRIPt, which uses this alignment column extraction philosophy. I think it'll help clarify the approach I just outlined.

John · November 14, 2023, 7:22pm

I'm trying a more stringent build to see if/how it differs per the rec in the latest and previous reply. I like the idea of trying an additional primer-pair search on the final output. Seems worth a try.

It does seem that the "old skool" approach per your link to the old prototype yields results that are more in line with what I would expect. I think I might have used Devon's pipeline that he posted in qiime2, but it's pretty much the same thing. That said, I was hoping that if I continue to tweak the new methodology that I'd hit a sweet spot. Mostly because that older pipeline was a bit tedious