COI Workflow Parameter Considerations


I have been reading through the forum resources related to CO1, in particular the resources and information posted for generating CO1 reference databases.

I have some data to run for the Elbrecht et al., macroinvert primer set, but this is my first time running CO1 data. I haven't found anything that talks about differences that I should consider when using this loci as compared to other regions (i.e, mitochondrial loci). I know that CO1 has degenerate bases that could be potentially problematic during the Dada2 step, in which sequences with degenerate bases could be identified as chimeras and thrown out. Does the general Qiime2 workflow stay the same for processing data for this loci? Any important step modifications I should consider would be very much appreciated!


1 Like

Mainly replying as I am also interested in the answer! I am currently helping on a project looking at groundwater macroinvertebrates and these primers (and some others) are currently being tested in the lab - happy to share databases compiled using that tutorial as a starter.

1 Like

Hi @lastewart ,

I can't help point you to papers that compare different mitochondrial loci, but certainly folks in this forum like @SoilRotifer could tell you more about the difference challenges with different primer sets.

I would offer up the general notion that no matter what DNA barcode you plan on using, be it mitochondrial or nuclear, if the goal is to cast a wide microinvert net, I'm not sure you're going to beat the BOLD database for sheer breadth of taxonomic representation.

If I could rewind my graduate experience I would have told that younger self to find a spot in a lab with access to a bevy of invertebrate vouchered specimen that I knew were going to be representative of my own study system. I would also have told that self to not study a system that spanned a dozen insect orders, plus spiders :exploding_head: ... as the number of samples to achieve something that constitutes "representative" at that point gets untenable... which is why you kind of always end up resorting to NCBI or BOLD.

So, to get back on track to your first question: can't help you directly in finding a paper, but I think it is also worth asking even if you did whether these other regions come with a database of taxonomic labels you'll need in your analyses, and if they didn't have that database, then what?

To your second question about degenerate bases in DADA2... I'd never considered my sequences as having degenerate bases, because it's not like the Illumina reads were themselves degenerate, nor were the primers truly (they were just a mixed bag of something like attccAcctta and attccGcctta (where A/G) was really written out in it's degenerate form when ordering primers as attccRcctta. @benjjneb can certainly speak to this and correct my thinking, but I can't think of a case where a sufficient number of sequence reads that the "degenerate" primer set amplified would be removed by denoising. Maybe if there was a particular combination of these degenerate positions that (1) amplified much less, and (2) somehow resulted in lower sequence quality? Maybe then DADA2 would throw them out?
Otherwise, DADA2 should just keep each of these as unique sequence variants.

One piece I'm unclear about:

Where does this happen in DADA2? Curious to hear more about the concern and maybe realize I've been overlooking an important consideration for years :cry: !?


Hi @Micro_Biologist,

I'm so glad to hear that you are also working on this primer set. I haven't been able to find a lot in the way of analysis publications for this region yet. But, I am definitely happy to share whatever resources I generate as well. I am currently working through the COI BOLD steps. Have you been able to compile a database yet? I would be very interested in comparing our outcomes to see how they differ.