Illumina read primer mispriming?

jsogin574 · June 29, 2021, 9:09pm

I am currently analyzing oldish Illumina data of ITS2 sequences prepared via a two-step protocol with ITS3 and ITS4 primers. I've noticed a peculiar case of 'missing primers in sequencing data' and am wondering if anyone has come across a similar issue. The reason this is in general discussion is because I'm not having any trouble using qiime or understanding qiime outputs... the data is just troubling.

Here is the situation:

My sequencing facility does not mix forward and reverse fragments when sequencing; i.e., forward fragments are in R1, reverse in R2.
The 2 x 250bp Miseq run seemed completely fine from a sequencing quality standpoint. I ran ITS samples with complementary 16S samples.
I've run q2-cutadapt a few times playing mostly with --p-error-rate and --p-overlap. The critical parameter to this situation is --p-discard-untrimmed. Long story short, removing primers from my 16S sequences proceeded normally and I retained most of my reads for all my samples. For my ITS sequences I retained anywhere from 10-90% of sequences independent of cutadapt settings, i.e. there was a huge range of sequences retained regardless of any parameters.
Upon further investigation, I found that several sequences in my forward ITS samples did not contain primers; interestingly enough, the reverse sequences did. Just for clarification, there were no partial matches, substitutions, insertions, deletions that suggested the bases at the beginning of the afflicted reads were remotely related to the primer.
The plot thickens. The primerless sequences contained the reverse complement of my reverse primer, were a portion of the ITS region I was sequencing, and corresponded (98-100% identity) to expected taxa. Thus, I have no reason to believe these sequences are contaminants.
Below is an alignment of the situation:
Capture3420×220 6.91 KB

Essentially, the primerless fragments I am getting begin downstream of my actual primer, and as previously mentioned, end at my reverse primer. I will also note that 'my example sequence' is 250 bp, as expected; if I had longer reads, I would expect the fragments to contain readthrough of my reverse primer, indices, and adapter sequences.

Why??

Maybe PCR artifacts, but there is no portion of the forward primer and the distribution of my fragment lengths was as expected - no large proportion of sequences too large or too small.
They may be chimeras, but the sequence corresponds to an organism I expect to be there and is extremely similar to the reference sequence.
They may be an artifact of bridge amplification gone wrong, but the overall quality of the run was fine.
They may be true fragments but the read 1 sequencing primer misprimed? Seems extremely unlikely but plausible. Is this even possible? Has this happened to anyone else?

image3424×222 5.82 KB

Any thoughts on this are more than welcome.

colinbrislawn · July 1, 2021, 4:51pm

Hello Jonathan,

This is a good mystery! I'm not sure I have a perfect explanation, but I'm happy to start the discussion before the trail goes cold.

Good!

Also good!

Not good!

So, for some reason, your R1 files contain the reverse complement of reads you would expect to see in R2? (Am I summarizing that correctly? Let me know if I'm missing an important detail!)

When you demultiplex with barcodes, do you notice that this happens more on some samples than others? I'm wondering if the wrong primers / adapters were added to the wrong wells, or something...

Can you tell me more about this protocol? Do you mean the regions are amplified first, then Illumina adapters are added?

Thanks!
Colin

jsogin574 · July 7, 2021, 2:05am

@colinbrislawn thank you for the investigative assistance

@colinbrislawn
“ So, for some reason, your R1 files contain the reverse complement of reads you would expect to see in R2? (Am I summarizing that correctly? Let me know if I'm missing an important detail!)”

No. The R1 reads are in the correct orientation. The read starts about 20-30 bp downstream of where my forward primer is supposed to hybridize. The length of those reads is still 250 bp. They conveniently end at the end of my target, so they also contain a portion of the reverse complement of my reverse primer. At first I thought my primer could have misprimed downstream of the actual target, but the reads in question are primerless (at least the forward primer).

@colinbrislawn
“ When you demultiplex with barcodes, do you notice that this happens more on some samples than others? I'm wondering if the wrong primers / adapters were added to the wrong wells, or something...”

My sequencing facility demultiplexes the reads, so I can’t say for sure, but I don’t think so. Prior to adapter trimming and after demultiplexing by my sequencing facility, I have 150-300k reads per sample, which was expected based on the number of samples I was running. Thus, I don’t really think the indices are the cause of this. After running cutadapt (where I only look for my primer sequence and not the indices) with the —p-discard-untrimmed, several of these samples drop to 10-40k reads, which was what initially started the query. Some samples lose more than others and it is not due to other cutadapt parameters as far as I can tell.

@colinbrislawn
“Can you tell me more about this protocol? Do you mean the regions are amplified first, then Illumina adapters are added?”

I do the 1st PCR with target specific primers that contain overhangs so that indices can be added in a 2nd PCR. I give the unbarcoded samples to my sequencing facility and they do the 2nd PCR.

Another possibility I’ve toyed with is that maybe barcoded mystery primers were introduced to my samples at the sequencing facility or on the flow cell, but I don’t want to throw my sequencing facility under the bus and such a thing has not happened before.

Still stumped…

colinbrislawn · July 7, 2021, 6:08pm

Hello again,

Thank you for the clarification! This is an important clue, that overlaps with what I've seen with 16S primers from the Earth Microbiome Project and the special way they can be sequenced on the Illumina platform.

Here's another clue, that could point to this special sequencing method.

When you review the EMP primers and sequencing protocol, you will notice both primers to make your amplicons through PCR and also an extra set of primers used on the flow cell during sequencing.

Note the difference!

On a typical Illumina run, the forward sequencing primer anneals to the Illumina adapter and the read starts with the region you primed and continues into the hypervariable region.
But in the EMP protocol, the V4 16S primer is used again, and the read starts directly in your hypervariable region so that your reads never include the forward primer.

If the sequencing core is using the EMP protocol or something like it, you would not expect to see the primers inside of your reads.

Could this explain the missing forward primers? Did your sequencing core provide information about the sequencing primers they used?

Colin

jsogin574 · July 7, 2021, 9:21pm

@colinbrislawn

mystery solved... the primers that were used were ITS7ngs and ITS4ngs...

I reran cutadapt with the correct primers and am now retaining at least 90% of reads for most of my samples even with strict parameters.

Thanks for your help!