COI growth spurt

I've been working on building COI databases lately - shameless plug here! One of the (optional) steps involved trimming all the sequences within a particular primer region. In this case, the expected length would be trimmed to about 180 base pairs (see Step 5b of the post linked above).

When I looked at the distribution of those primer-trimmed sequences with my initial dataset, I was relieved to see that the vast majority of the sequences remaining fit neatly into that expected length:

For the last week, I've been exploring another RESCRIPt tool that can be used to build an alternative COI database using sequences from NCBI GenBank (as opposed to my initial method which pulled data from BOLD) - thanks!? for the extra work @BenKaehler and @Nicholas_Bokulich :face_with_raised_eyebrow: :smile:

When I pulled data from NCBI using the same filtering methods applied to my earlier BOLD dataset, the distribution looks similar insofar that the largest peak is around 180 bp - nice. However... there seems to be a lot of sequences larger than expected, and all at a particular length (around 250 bp):

@SoilRotifer - have you ever seen something like this where a primer-trimmed region has this kind of behavior? Recall that these sequences were initially dereplicated prior to generating the alignment used to identify the primer coordinates - in other words, this isn't the result of some quirk of one sequence being repeatedly uploaded to GenBank.

In fact, after a bit of digging into the taxonomic classifications of these 100,000 sequences, it seems like these longer sequences exist across the majority of COI taxa. It's in chordates, arthropods, molluscs; it's in fungi, and other microbe sequences. This makes me think that it isn't an artifact, but I'm not really sure what the next step would be to identify if it was an artifact at all.

The next step I was was considering was to take about 1000 sequences from the group that are longer than 200 bp, and 1000 sequences from the group that are about 180 bp, align them to each other, and see if there is a consistent location where the longer sequences get their added length from.

Appreciate any other insights and critiques. Thanks!

1 Like

Hi @devonorourke, this is great detective work!

I’ve seen this several times for a variety of marker genes. My first instinct would be to make sure these are not mis annotated gene sequences. I’ve had cases where I’ve captured related or paralogous sequences as part of my downloaded data. I would:

  • Spot-check the gene annotations.
  • Spot-check associated sequence taxonomy
  • Make the alignment you spoke of, and then build a phylogeny (w/o masking the alignment). To make sure these are not paralogous gene sequences. In other words, make sure the annotated taxonomies and phylogeny are mostly consistent with each other.
    • Regarding paralogs: likely numts (nuclear mitochondrial DNA), i.e. CO1 gene copies that have migrated to the nucleus of eukaryotes. These are commonly encountered in many of environmental sequence surveys, and often not properly spotted / vetted prior to being uploaded and correctly annotated as such in a database.

I’d be happy to help with this, particularly when it comes to building the COI phylogenies:


Thanks for great ideas @SoilRotifer ,

I was hoping you could provide an example of what you mean in terms of:

  • spot checking gene annotations
  • spot checking sequence taxonomy

Because these data were obtained using get-ncbi-data, I know the parameters of the search terms, but I don’t have any metadata records on hand beyond their accession numbers. Is that what you meant by gene annotation?

In terms of checking on sequence taxonomy, I can confirm these 100,000 sequences come from a range of taxa. The proportion of these longer fragments generally fall into the same proportions of the entire data set: mostly are arthropods :honeybee: , lots of molluscs :squid: and chordates :bat: , and a grab bag of other taxa :microbe: :mushroom:

I building a tree, the tutorial you linked seems very clear. I’ll get working on that right away, but was curious about:

  • if there are any particular parameters and/or programs you’d recommend
  • what I should be looking for in the rooted (or unrooted) output

My guess is that the numts should form their own clade, correct? The reason I’m suspicious that it’s numts is that there are thousands and thousands of these nearly identically long sequences coming from a huge range of taxa. If they were really numts, wouldn’t you expect a broader range of lengths than this one single peak at ~ 250 bp?


I was just thinking that you might have to manually look-up a few GenBank records, and see what is listed as the gene and taxonomy annotations, if any. Though this may not be needed if you used this as part of your search query. Perhaps even run blastx on a few of these to see what coding sequence is returned?

Also, by checking the taxonomy, I mean: How often do the taxonomies of these longer sequences match the exact taxonomy from the sequences that fit your expected sequence lengths? If many of these sequences have the same exact taxonomies, then that points to something interesting. These would be good targets to construct a tree from. Perhaps write a script that will extract a subset sequences from both of these groups, i.e. 180 bp and the 250 bp, that have exact matching taxonomies (or at least to genus) and make a tree?

What did you use as your GenBank queries? Depending on how well structured, or not, these are, there is a small chance that non-COI sequence data are being returned. Though I am sure it’s fine, just asking as more of a sanity-check. :slight_smile:

Maximum likelihood phylogenies are often returned as unrooted trees. You’ll have to run qiime phylogeny midpoint-root... to root them. For de novo phylogenetic inference, I currently prefer iqtree. It will try and find the best substitution model for your data prior to building the tree. I’d use default values and use the --p-fast and --p-n-cores 'auto' options. But, if you want to ignore the model testing part (which can take a while), using the --p-substitution-model 'GTR+G' option should be sufficient.

No, not necessarily… numts can potentially be spread throughout the tree. That is the numt of one organism can potentially look like the actual mitochondrial sequence of another divergent taxa. Hence the problem with numts. I encountered many of them just from my soil rotifer sequencing surveys. There is other literature out there of numts in fungi, etc…

It is indeed interesting that there is a consistent 250 bp length for these sequences. They could indeed be real and not numts! I’d look up the publications from which these sequences are associated and see if any of the authors make note of the length distribution of their PCR products, etc… However, I think the taxonomic sanity check procedure I outlined above may help shed light on this.

-Best wishes!

Thanks again for the extensive response @SoilRotifer,

After looking up a few of the GenBank accession IDs, the records tend to contain some mention of either "COI" or "cytochrome oxidase I" or something like that. See examples here, here, and here.

I can confirm that in those three examples above, all return BLAST searchers for COI sequences. So they're certainly the right(ish) gene (that is, they're clearly not a misannotation of a fragment unrelated to COI like cytB or something).

Unfortunately, the papers in these examples all generated COI fragments much longer than the sequences I'm comparing - these were near Sanger-sequenced COI fragments, usually about 700-800 bp. Thus, when papers report varying lengths (if observed), it's tricky to know if that relates to my particular region. I wonder, however, if the fact that these were longer fragments generated by Sanger in the first place is a meaningful safeguard against numts in the first place? This was my understanding from a section of the paper you highlighted at the beginning of this thread:

Semi-nested/nested PCR amplification on long-range PCR product . Numts tend to be short (with a mean length approaching 100 to 300 bp (Pamilo et al., 2007, Richly and Leister, 2004)) while native mtDNA is a ca. 16 kbp molecule in metazoans. Hence, PCR of standard length ( e.g. ca. 650 bp for the primer set LCO1490/HCO2198 (Folmer et al., 1994)) might benefit from being performed on longer, multi kbp amplicons. Maybe these really are just interesting, but particular, differences in COI?

@BenKaehler might chime in here with a correction, but to the best of my knowledge, the NCBI data in question was obtained with the following command:

qiime rescript get-ncbi-data \
--p-query '(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) NOT "BARCODE"[KYWD])' \
--output-dir NCBIdata_notBOLD

This search term was used to collect only those COI sequences that didn't also have a cross-reference keyword for BOLD.

As an aside, I also have conducted the same analysis using those that included that keyword (that is, the NCBI sequences that do contain the BOLD cross reference keyword). And guess what that sequence composition looks like? ...

:drum: roll ....

Well, it looks just like the stuff I pull from BOLD directly - no evidence of substantial 250bp insertions, despite coming from NCBI too!:

I'm going to try building a phylogeny now and see what I can come up with. The mystery continues.
:alien: :dna:?

1 Like

‘(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) NOT “BARCODE”[KYWD])’

Okay cool, quite similar to what I’ve run in the past:

(cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) AND mitochondrion[Filter] NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]"

Note that I added a few more terms limit other spurious data.

Hmm good point. I think that my indeed be the case.

Well, hot dang! that is interesting!


Appreciate all the additional filters - I’ll be sure to pass along these to @BenKaehler and @Nicholas_Bokulich who were the ones assisting me with generating the search terms to begin with.

More to follow in a few. thanks!

No need, this was from my search query that I used, and provided, when we were testing the get-ncbi-data code. :slight_smile:

Any recommendations @SoilRotifer on how to examine these 270,000 labels?
It’s a big tree. :deciduous_tree:

I started by filtering sequences into two bins: those with degapped sequence lengths between 175-185 bp, and those with 250-260 bp. From those two bins, I identified the taxonomy labels from Kingdom to Genus that were shared among these sequences and then built a tree with iqtree .

Having spent only a little time with FigTree in the past building phylogenies out of a few dozen samples, viewing this is… a bit more challenging? Thanks for any tips on how to visualize this, or perhaps filter it without visualizing to still understand whether the shorter/longer sequences tend to fall into similar clades

Hi @devonorourke,

Ohh, that is a big tree! I’d recommend using Dendroscope, it scales really well to large trees. Otherwise you may have to play around with more automated ways of coloring / annotating the tree with ggtree, dendropy, ETE, …