I'm using the commonly adopted COI primer pair mlCOIintF–jgHCO2198, which is designed to amplify a ~313 bp fragment. In my first-round PCR (including Illumina overhang adapters), I observed a band at around 465 bp on the TapeStation, which matches the expected size when overhangs are included.
However, after running qiime dada2 denoise-paired and checking the representative sequences, the final sequence lengths are consistently 313 bp, as originally expected.
After the second PCR (indexing), my library size increases further to ~500–520 bp. As far as I understand, the full Illumina adapter + index combination should add roughly 120 bp, so this size seems larger than expected.
I'd appreciate any clarification on the following:
Why does the PCR product appear as ~465 bp (with overhangs), but becomes exactly 313 bp in QIIME2 after denoising?
Why might my final library size (after indexing) reach ~520 bp when I expect ~313 + 120 = ~433 bp?
It'd be helpful to see your entire workflow, can you provide the commands used?
The primers that you are using should produce a fragment length of ~ 365 bp (the targeted 313 bases plus the primers). See this wonderful paper by Elbrecht et al. 2019.
Are you sure the 465 bp fragment length you are reporting is not actually 365 bp?
Assuming you are using a standard sequencing protocol (the primer sequence is contained within the reads), you should remove the PCR primers from your sequences prior to denoising, using cutadapt. That being said, I am not sure if this was already performed prior to denoising, or the sequencing protocol you are using does not read through the primer (that is, the primer sequence is not contained within your read). Either of these two conditions is the only way I can see that you are able to reconstruct a 313 bp fragment.
It's been a while since I've dealt with sequencing libraries directly, so I'll leave it to someone else to address your question about library sequence length. It'd also help to know what library construction protocol are you using?
Thank you for your thoughtful response and for sharing the Elbrecht et al. (2019) paper—it’s indeed a great reference.
To clarify my workflow:
I performed initial quality control using cutadapt with parameters -q 20 -m 200.
During the denoising step, I trimmed 26 bp from both forward and reverse reads to remove the primer sequences (--p-trim-left-f 26, --p-trim-left-r 26).
The sequencing was done using Illumina MiSeq 2 × 300 bp paired-end reads.
Based on your comment and the cited paper, I agree that the expected amplicon length after primer removal should be around 313 bp, which aligns with my observed results in QIIME2.
However, what still puzzles me is the ~30–40 bp discrepancy between the expected PCR product length (313 bp insert + ~52 bp primer + ~67 bp overhang ≈ ~430 bp) and the ~460–470 bp size I observed in TapeStation analysis. I’m unsure what accounts for this difference and would appreciate any insight or suggestions from the community.
I'd suggest using cutadapt with the --p-discard-untrimmed flag, to explicitly remove the primers instead of using the DADA2 trimming feature. The reason is, I've seen many times where there can be indels around the primer region which can result in inflated ASV counts. Also, some sequencing protocols will read through a few extra bases prior to reading through the primer sequence. Meaning, when you only trim using a set number, you might actually be leaving in some of the primer in your read because of those extra bases at the 5' end. Whereas cutadapt will remove the primer and any bases prior to the primer at the 5' end.
Using cutadapt, will allow some mismatches and more consistently remove the primer region. Remember, if there is a 1 indel difference in a sequence, it'll likely be considered a new ASV.
I also use cutadapt as a form of quality control. That is, if cutadapt can not find the primer in the read, then the rest of the sequence is likely poor, or is a spurious off-target read, anyway. Thus if cutadapt cannot find the primer it will discard the pair of reads.
I would only assume that there is some missing information in your library prep protocol? Have you reached out to the company / sequencing provider? They might have more information for you. On the other hand, if you are getting the data you expect, you should be fine. But I totally understand that this would be helpful to know. I am sure someone smarter than me will provide more concrete answers.
Thank you so much for the clear explanation — it helped me realize an important detail I had previously overlooked. As you mentioned, since the results are as expected, I believe things should be fine. I really appreciate your kind and thoughtful support!