Using RESCRIPt's 'extract-seq-segments' to extract reference sequences without PCR primer pairs.

John · October 26, 2023, 4:10pm

Howdy! I am curious if you have any thoughts about the following outcome after running this script to build a few different DB's (rbcL, ITS2, CO1) and if you think what I'm getting is expected. For each of these databases, I extracted the region of interest using the primers and then proceeded to the pool expansion steps. I of course end up with a pretty small pool during the initial extraction with the primers. I then move to the expansion step and am starting conservative (perc identity of 0.90). I'm finding that by the third iteration of pool expansion the average length of the reference sequences goes from roughly the size I would expect for the amplicon (181 in the case of our anml primers) to upwards of a 1000 by the time I reach iteration three using 0.9 for each step. Moreover, it isn't just a few sequences that are now 1000; it's 75% or more of the database. This region does not vary significantly, so it seems worrisome. I'm curious if that is what you would expect? Do you think if I filter seqs based on what I would anticipate the max length to be in this database that I would be OK? I'm just worried at how large the sequences have gotten. Perhaps I don't clearly understand how it's working and that I shouldn't worry about these results. In fact, that's what I'm hoping!