Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs.

Great question! You are right to be concerned about this. This is exactly why I state in the tutorial:

This can become more of a problem if you are not cleaning the output after each iteration. The more ambiguous IUPAC bases you have the more spurious things can become. I've built databases for these amplicons (rbcL, ITS2, CO1) too, and have had to filter based on length etc.. Thus, I'd recommend checking a few of these spuriously long sequences via online BLAST, etc.

That being said...

It is okay if there is variation, as some of the data in GenBank might not be of high quality. Also, it could be, after these iterations, that you are at the point where you are extracting the fuller-length sequences of the marker genes you are searching for. If this is the case, then it might not be an issue.

Again, as you've noted, the initial PCR primer pair extracts only the amplicon region, the later iterations could expand your reference pool to sequences that contain a longer portion of your marker gene, in which the primer sequences are not a great match (which is why they were not extracted initially), as expected given primer biases etc...

Finally, you can consider starting with a 90% cutoff, then increase by 2-5 % after each iteration, e.g. 90, 95, 97, ...

Hopefully, this helps. :slight_smile:

3 Likes