I have a primer pair GGACTACHVGGGTWTCTAAT. I tried to search for the original primer itself in my sequences and the reverse version (by searching for rev into unix terminal) and all my fastq files returned 0 for these primers, which are as follows:
Searching for the primer itself:
grep -c "GGACTAC[ACT][ACG]GGGT[AT]TCTAAT" *.fastq
Searching for the reverse (from unix terminal)
echo "GGACTACHVGGGTWTCTAAT" | rev
output: TAATCTWTGGGVHCATCAGG
grep -c "TAATCT[AT]TGGG[ACG][ACT]CATCAGG" *.fastq
But then when I use this site https://reverse-complement.com/ which was recommended for me to use (in a previous post), it returned this sequence:
When I search with this last one (reverse complement from the website), some sequences returned 0, some returned a bit, and some returned a lot of these sequences. Do you think I should run cutadapt to remove them or since the original sequences retrieved zero for ALL fastq files I shouldn't worry about them? They are single-end sequences.
Cool. The reverse compliment of the reverse primer would appear at the 3' end of the read, if at all... If the length of these reads are between 250 - 300 bases, then there is a small chance these primers could be detected in a few reads, though rare for this amplicon region.
If so, you can run cutadapt, w/o the --discard-untrimmed flag, and use the -p-adapter flag as this will explicitly search the 3' end of the read. This will only remove the reverse primer from the 3' end when detected otherwise leave the read as it is. Then you can truncate the forward reads to a fixed length after that when denoising, etc....
Cool. The reverse compliment of the reverse primer would appear at the 3' end of the read, if at all... If the length of these reads are between 250 - 300 bases, then there is a small chance these primers could be detected in a few reads, though rare for this amplicon region.
The reads are between 61-151 bases. How does it affect the detection of primers?
If so, you can run cutadapt, w/o the --discard-untrimmed flag, and use the -p-adapter flag as this will explicitly search the 3' end of the read. This will only remove the reverse primer from the 3' end when detected otherwise leave the read as it is. Then you can truncate the forward reads to a fixed length after that when denoising, etc....
So, if the reverse compliment of the reverse primer appear it means that they are probably on the 3' end of the read? And should I use the --p-adapter option everytime this situation occurs? I ask because I'm facing the same situation but in another dataset, with paired-end reads.
Because the V4 amplicon is ~254 bases. Meaning that a single read of 151 bases is not long enough to reach the other end where the reverse primer is located. Thus, in this case it should never be detected. If it is detected, then it is likely a spurious sequence that will likely be removed via any other other QA/QC steps you perform down stream anyway.
Because the V4 amplicon is ~254 bases. Meaning that a single read of 151 bases is not long enough to reach the other end where the reverse primer is located. Thus, in this case it should never be detected. If it is detected, then it is likely a spurious sequence that will likely be removed via any other other QA/QC steps you perform down stream anyway.
Last few questions: does this situation also apply for when only primers in forward reads are detected, but when the sequence is 242 base pairs (not long enough as well)?
These sizes are universal or they vary according to platform/distributor? For example, as this website says: 16S Reference | Omega Bioservices ?
Depending on the sequencing protocol used, you either sequence through the primer or you do not. For example, the EMP protocol does not sequence through the primer. Unless the Illumina adapters and sequencing primers are not properly removed prior to you receiving the data, you will not have any primers in your sequence.
Otherwise, the resulting data will include the primers as part of your sequence.
There is some minor length variation, but they are usually within the same ball park. This is especially true for Illumina sequence data due to do the sequencing process works on that platform. Usually the length is reported as the total PCR amplicon length. That is, including the primer sequences. When you remove the primers (e.g. cutadapt) the amplicons will be ~30-50 bases shorter.