I've been working on building COI databases lately - shameless plug here! One of the (optional) steps involved trimming all the sequences within a particular primer region. In this case, the expected length would be trimmed to about 180 base pairs (see Step 5b
of the post linked above).
When I looked at the distribution of those primer-trimmed sequences with my initial dataset, I was relieved to see that the vast majority of the sequences remaining fit neatly into that expected length:
For the last week, I've been exploring another RESCRIPt tool that can be used to build an alternative COI database using sequences from NCBI GenBank (as opposed to my initial method which pulled data from BOLD) - thanks!? for the extra work @BenKaehler and @Nicholas_Bokulich
When I pulled data from NCBI using the same filtering methods applied to my earlier BOLD dataset, the distribution looks similar insofar that the largest peak is around 180 bp - nice. However... there seems to be a lot of sequences larger than expected, and all at a particular length (around 250 bp):
@SoilRotifer - have you ever seen something like this where a primer-trimmed region has this kind of behavior? Recall that these sequences were initially dereplicated prior to generating the alignment used to identify the primer coordinates - in other words, this isn't the result of some quirk of one sequence being repeatedly uploaded to GenBank.
In fact, after a bit of digging into the taxonomic classifications of these 100,000 sequences, it seems like these longer sequences exist across the majority of COI taxa. It's in chordates, arthropods, molluscs; it's in fungi, and other microbe sequences. This makes me think that it isn't an artifact, but I'm not really sure what the next step would be to identify if it was an artifact at all.
The next step I was was considering was to take about 1000 sequences from the group that are longer than 200 bp, and 1000 sequences from the group that are about 180 bp, align them to each other, and see if there is a consistent location where the longer sequences get their added length from.
Appreciate any other insights and critiques. Thanks!