Question about primers/cutadapt (reverse-complement)

Liviacmg · September 26, 2024, 3:04pm

Hey guys,

I have a primer pair GGACTACHVGGGTWTCTAAT. I tried to search for the original primer itself in my sequences and the reverse version (by searching for rev into unix terminal) and all my fastq files returned 0 for these primers, which are as follows:

Searching for the primer itself:
grep -c "GGACTAC[ACT][ACG]GGGT[AT]TCTAAT" *.fastq

Searching for the reverse (from unix terminal)
echo "GGACTACHVGGGTWTCTAAT" | rev
output: TAATCTWTGGGVHCATCAGG

grep -c "TAATCT[AT]TGGG[ACG][ACT]CATCAGG" *.fastq

But then when I use this site https://reverse-complement.com/ which was recommended for me to use (in a previous post), it returned this sequence:

ATTAGAWACCCBDGTAGTCC

Substituting wild cards (from IUPAC codes: IUPAC Codes)

grep -c "ATTAGA[AT]ACCC[CGT][AGT]GTAGTCC" *.fastq

When I search with this last one (reverse complement from the website), some sequences returned 0, some returned a bit, and some returned a lot of these sequences. Do you think I should run cutadapt to remove them or since the original sequences retrieved zero for ALL fastq files I shouldn't worry about them? They are single-end sequences.

Thank you in advance.

SoilRotifer · September 26, 2024, 4:59pm

Hi @Liviacmg,

Cool. The reverse compliment of the reverse primer would appear at the 3' end of the read, if at all... If the length of these reads are between 250 - 300 bases, then there is a small chance these primers could be detected in a few reads, though rare for this amplicon region.

If so, you can run cutadapt, w/o the --discard-untrimmed flag, and use the -p-adapter flag as this will explicitly search the 3' end of the read. This will only remove the reverse primer from the 3' end when detected otherwise leave the read as it is. Then you can truncate the forward reads to a fixed length after that when denoising, etc....

qiime cutadapt trim-single \
    --i-demultiplexed-sequences  sample1_R1.fastq \
    --p-adapter  ATTAGAWACCCBDGTAGTCC \
    --o-trimmed-sequences  sample1_R1_trim_rev_primer.fastq

Liviacmg · September 26, 2024, 5:34pm

Cool. The reverse compliment of the reverse primer would appear at the 3' end of the read, if at all... If the length of these reads are between 250 - 300 bases, then there is a small chance these primers could be detected in a few reads, though rare for this amplicon region.

The reads are between 61-151 bases. How does it affect the detection of primers?

If so, you can run cutadapt, w/o the --discard-untrimmed flag, and use the -p-adapter flag as this will explicitly search the 3' end of the read. This will only remove the reverse primer from the 3' end when detected otherwise leave the read as it is. Then you can truncate the forward reads to a fixed length after that when denoising, etc....

So, if the reverse compliment of the reverse primer appear it means that they are probably on the 3' end of the read? And should I use the --p-adapter option everytime this situation occurs? I ask because I'm facing the same situation but in another dataset, with paired-end reads.

Thank you so much for the quick reply!

SoilRotifer · September 26, 2024, 6:08pm

Because the V4 amplicon is ~254 bases. Meaning that a single read of 151 bases is not long enough to reach the other end where the reverse primer is located. Thus, in this case it should never be detected. If it is detected, then it is likely a spurious sequence that will likely be removed via any other other QA/QC steps you perform down stream anyway.

Correct.

Liviacmg · September 26, 2024, 6:49pm

Because the V4 amplicon is ~254 bases. Meaning that a single read of 151 bases is not long enough to reach the other end where the reverse primer is located. Thus, in this case it should never be detected. If it is detected, then it is likely a spurious sequence that will likely be removed via any other other QA/QC steps you perform down stream anyway.

Oohh, I didn't know that! Thank you SO much!

Liviacmg · September 26, 2024, 10:56pm

@SoilRotifer ,

Last few questions: does this situation also apply for when only primers in forward reads are detected, but when the sequence is 242 base pairs (not long enough as well)?

These sizes are universal or they vary according to platform/distributor? For example, as this website says: 16S Reference | Omega Bioservices ?

SoilRotifer · September 27, 2024, 12:56am

Depending on the sequencing protocol used, you either sequence through the primer or you do not. For example, the EMP protocol does not sequence through the primer. Unless the Illumina adapters and sequencing primers are not properly removed prior to you receiving the data, you will not have any primers in your sequence.

Otherwise, the resulting data will include the primers as part of your sequence.

There is some minor length variation, but they are usually within the same ball park. This is especially true for Illumina sequence data due to do the sequencing process works on that platform. Usually the length is reported as the total PCR amplicon length. That is, including the primer sequences. When you remove the primers (e.g. cutadapt) the amplicons will be ~30-50 bases shorter.

system · October 28, 2024, 6:57am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.