I'm working on an analysis and found this post detailing how to look for primers in the raw reads to determine if cutadapt should be run. I ran these commands and got the resulting primer counts:
I've never experienced finding primers in only one of the read pairs. And this is 96% of the reverse reads having primers (much, much more than the 25% cutoff mentioned in the previous post), but none of the forward reads. I was wondering if this was something I should be concerned about and if not, whether using cutadapt to remove the reverse primer only would cause issues.
I added another zgrep that replaces the N with any character and now the number of primers in reverse reads is basically the same as reverse (which doesn't have any N's).
In the vocabulary of primer design, N is any Nucleotide, but grep does not know this and searched for a literal N. Once we translate this to the vocab of regular expressions (. for any character) it works as intended!
I had a quick follow-up question about this. When I first ran the zgrep commands I had a ^ at the beginning to tell the regex to only look at the beginning of reads. Your examples did not include this, and I was wondering if using the ^ would be more accurate to search for primers (since they would be at the beginning of the read)? I do get less counts when I run with the caret (as expected). Is it acceptable to limit the search to the beginning of the read or does the caret throw something off/skew something?
It's totally acceptable! You can choose how to select the primers you want.
Because matching from the start of the read is more specific / restrictive, it will miss results with extra characters before it, just like you observed. When I made my suggestions, I was hoping to find more results, so I went with the most general / least restrictive regular expression by dropping the ^carrot.
I guess this opens a bigger questions about specific vs accurate primers:
Would the more specific regex search be more accurate, or do we expect some reads to have extra basepairs before the primer leading to this more specific regex being less accurate because it misses some reads?