COI Workflow Parameter Considerations


I have been reading through the forum resources related to CO1, in particular the resources and information posted for generating CO1 reference databases.

I have some data to run for the Elbrecht et al., macroinvert primer set, but this is my first time running CO1 data. I haven't found anything that talks about differences that I should consider when using this loci as compared to other regions (i.e, mitochondrial loci). I know that CO1 has degenerate bases that could be potentially problematic during the Dada2 step, in which sequences with degenerate bases could be identified as chimeras and thrown out. Does the general Qiime2 workflow stay the same for processing data for this loci? Any important step modifications I should consider would be very much appreciated!



Mainly replying as I am also interested in the answer! I am currently helping on a project looking at groundwater macroinvertebrates and these primers (and some others) are currently being tested in the lab - happy to share databases compiled using that tutorial as a starter.

1 Like

Hi @lastewart ,

I can't help point you to papers that compare different mitochondrial loci, but certainly folks in this forum like @SoilRotifer could tell you more about the difference challenges with different primer sets.

I would offer up the general notion that no matter what DNA barcode you plan on using, be it mitochondrial or nuclear, if the goal is to cast a wide microinvert net, I'm not sure you're going to beat the BOLD database for sheer breadth of taxonomic representation.

If I could rewind my graduate experience I would have told that younger self to find a spot in a lab with access to a bevy of invertebrate vouchered specimen that I knew were going to be representative of my own study system. I would also have told that self to not study a system that spanned a dozen insect orders, plus spiders :exploding_head: ... as the number of samples to achieve something that constitutes "representative" at that point gets untenable... which is why you kind of always end up resorting to NCBI or BOLD.

So, to get back on track to your first question: can't help you directly in finding a paper, but I think it is also worth asking even if you did whether these other regions come with a database of taxonomic labels you'll need in your analyses, and if they didn't have that database, then what?

To your second question about degenerate bases in DADA2... I'd never considered my sequences as having degenerate bases, because it's not like the Illumina reads were themselves degenerate, nor were the primers truly (they were just a mixed bag of something like attccAcctta and attccGcctta (where A/G) was really written out in it's degenerate form when ordering primers as attccRcctta. @benjjneb can certainly speak to this and correct my thinking, but I can't think of a case where a sufficient number of sequence reads that the "degenerate" primer set amplified would be removed by denoising. Maybe if there was a particular combination of these degenerate positions that (1) amplified much less, and (2) somehow resulted in lower sequence quality? Maybe then DADA2 would throw them out?
Otherwise, DADA2 should just keep each of these as unique sequence variants.

One piece I'm unclear about:

Where does this happen in DADA2? Curious to hear more about the concern and maybe realize I've been overlooking an important consideration for years :cry: !?


Hi @Micro_Biologist,

I'm so glad to hear that you are also working on this primer set. I haven't been able to find a lot in the way of analysis publications for this region yet. But, I am definitely happy to share whatever resources I generate as well. I am currently working through the COI BOLD steps. Have you been able to compile a database yet? I would be very interested in comparing our outcomes to see how they differ.

More than happy to share I kept as much detail as could in there, but if you need any more let me know. Our lab still haven't actually ran them yet so I've no idea how they will work in reality. We're comparing BF1/BR2 and BF2/BR2 from that paper, the 'standard' Leray primers, and 2 sets from a follow up paper in 2020 from Leese, et al.,

I should have some preliminary results soon - I believe 1 of the primer sets from the 2020 paper just didn't work although this may be due to time constraints.

1 Like

Hi @lastewart & @Micro_Biologist,

You can modify the approach outlined here, by replacing the initial step with:

! qiime rescript get-ncbi-data \
    --p-query "txid33208[ORGN] AND (cytochrome c oxidase subunit 1[Title] OR cytochrome c oxidase subunit I[Title] OR cytochrome oxidase subunit 1[Title] OR cytochrome oxidase subunit I[Title] OR COX1[Title] OR CO1[Title] OR COI[Title]) AND mitochondrion[Filter] NOT environmental sample[Title] NOT environmental samples[Title] NOT environmental[Title] NOT uncultured[Title] NOT unclassified[Title] NOT unidentified[Title] NOT unverified[Title]" \
    --p-ranks  kingdom phylum class order family genus species \
    --p-rank-propagation \
    --p-n-jobs 1 \
    --o-sequences COI-ref-seqs.qza \
    --o-taxonomy COI-ref-tax.qza \

Note: this example downloads metazoa. As there is a lot of data, I'd recommend that you download in chunks of taxonomic groups. For example you can then replace the txid numbers with the taxonomic groups you think are relevant for your reference database, e.g.:

  • fungi: txid4751[ORGN]
  • rhodophyta: txid2763[ORGN]
  • alveolata: txid33630[ORGN]
  • viridiplantae: txid33090[ORGN]
  • stramenopiles: txid33634[ORGN]
  • rhizaria: txid543769[ORGN]
  • etc...

Then use qiime feature-table merge-seqs ... and qiime feature-table merge-taxa ... on the outputs. Then proceed with the rest of the tutorial linked above adjusting all the downstream commands as appropriate.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.