Primers in reverse reads only?

Hello,

I'm working on an analysis and found this post detailing how to look for primers in the raw reads to determine if cutadapt should be run. I ran these commands and got the resulting primer counts:

#Count primers in forward reads
zgrep "^CCTACGGGNGGC[AT]GCAG" fastqs/Day-4-CCL28-529L-1*R1*.gz | wc -l
0
#Count primers in reverse reads
zgrep "^GACTAC[ACT][ACG]GGGTATCTAATCC" fastqs/Day-4-CCL28-529L-1*R2*.gz | wc -l
206283
#Count total reverse reads
zgrep "@M" fastqs/Day-4-CCL28-529L-1*R2*.gz | wc -l
214168

I've never experienced finding primers in only one of the read pairs. And this is 96% of the reverse reads having primers (much, much more than the 25% cutoff mentioned in the previous post), but none of the forward reads. I was wondering if this was something I should be concerned about and if not, whether using cutadapt to remove the reverse primer only would cause issues.

Thanks!
Samantha

Hello Samantha,

Yeah, that is a little surprising, but not impossible. However, I'm guessing the issue is with the grep search.

For the forward reads, only exact matches (with an A or T in the middle) would be returned, so a small change would throw it off.

Matched: CCTACGGGNGGCTGCAGAAAAAAAAAAAAAA
Missed:  CCTACGGGNGGCTGCACAAAAAAAAAAAAAA
Missed:   CTACGGGNGGCTGCAGAAAAAAAAAAAAAA
Missed: CCCTACGGGNGGCTGCAGAAAAAAAAAAAAAA

Try loosening up that zgrep search so it can capture small deviations in the expected primer:

zgrep "CCTACGGGNGGC[AT]GCAG"
zgrep "CCTACGGGNGGC[AT]"
zgrep "GGGNGGC[AT]GCAG"

Let us know what you find,
Colin

P.S. I've moved this to Other Bioinformatics Tools as it's related to grep / zgrep :+1:

Hi Colin,

Thanks for the reply. I tried the zgrep commands you recommend for the forward reads and all are still zero:

zgrep "CCTACGGGNGGC[AT]GCAG" fastqs/Day-4-CCL28-529L-1*R1*.gz | wc -l
0
zgrep "CCTACGGGNGGC[AT]" fastqs/Day-4-CCL28-529L-1*R1*.gz | wc -l
0
zgrep "GGGNGGC[AT]GCAG" fastqs/Day-4-CCL28-529L-1*R1*.gz | wc -l
0

I added another zgrep that replaces the N with any character and now the number of primers in reverse reads is basically the same as reverse (which doesn't have any N's).

zgrep "CCTACGGG.GGC[AT]GCAG" fastqs/Day-4-CCL28-529L-1*R1*.gz | wc -l
208350

Thanks for your help, it appears that zgrep was breaking because it couldn't find an "N" exactly.

P.S. I've moved this to Other Bioinformatics Tools as it's related to grep / zgrep :+1:

Sorry about that, I was thinking in terms of the reads themselves, not the tool!

Thanks,
Samantha

OK, I'm glad we figured it out!

For future users, here's what happened:

In the vocabulary of primer design, N is any Nucleotide, but grep does not know this and searched for a literal N. Once we translate this to the vocab of regular expressions (. for any character) it works as intended!

1 Like

Hi Colin,

I had a quick follow-up question about this. When I first ran the zgrep commands I had a ^ at the beginning to tell the regex to only look at the beginning of reads. Your examples did not include this, and I was wondering if using the ^ would be more accurate to search for primers (since they would be at the beginning of the read)? I do get less counts when I run with the caret (as expected). Is it acceptable to limit the search to the beginning of the read or does the caret throw something off/skew something?

Thanks,
Samantha

It's totally acceptable! You can choose how to select the primers you want.

Because matching from the start of the read is more specific / restrictive, it will miss results with extra characters before it, just like you observed. When I made my suggestions, I was hoping to find more results, so I went with the most general / least restrictive regular expression by dropping the ^carrot.


I guess this opens a bigger questions about specific vs accurate primers:

Would the more specific regex search be more accurate, or do we expect some reads to have extra basepairs before the primer leading to this more specific regex being less accurate because it misses some reads?

:thinking: