How do I check adapter and primer presence in sequences?

Cele_Blua · July 10, 2023, 9:18am

Hi everyone! First of all Id like to thank you for this wonderful job. By the way, Im a new Qiime2 user and this forum is awesome.
On the other hand, Ive got raw V3-V4 16SRNA gene region amplicon sequence data (primers 341F and 805R), obtained by Illumina MiSeq. I need to know if these sequences keep the primers and the adapters or not. Is there any way to infer it at least? Ive no contact with the sequencing service provider but I know primers and adapter sequences.
Primers
341F Illumina 5'CCTACGGGNGGCWGCAG 3'V3-V4 Fw
805R Illumina 5'GACTACHVGGGTATCTAATCC 3' V3-V4 Rv

Ive explored some options:
1- Check it with FASTQC: I got some overepresented sequences but I dont know how to interpret the result
2- Blast primers and adapters in some way in one or two samples, but I dont know how to do this.

Thank you so much for your help!!

SoilRotifer · July 10, 2023, 1:59pm

Hi @Cele_Blua,

It depends on the sequencing strategy used by the provider. If you are unable to contact the provider, then I'd suggest running the cutadapt plugin with the --verbose option to see if it finds your primers.

To start I'd recommend the following command:

qiime cutadapt trim-paired \
    --i-demultiplexed-sequences demux-pe.qza
    --p-front-f CCTACGGGNGGCWGCAG \
    --p-front-r GACTACHVGGGTATCTAATCC \
    --p-discard-untrimmed \
    --o-trimmed-sequences demux-pe-trimmed.qza \
    --verbose

Cele_Blua · July 12, 2023, 2:13pm

Hi! Its me again. I used this command for trimming primers (Ive used trim-single because ive decided to work with forward sequences only)
qiime cutadapt trim-single
--i-demultiplexed-sequences ./demux_seqs.qza
--p-front CCTACGGGNGGCWGCAG
--p-discard-untrimmed
--o-trimmed-sequences ./demux_seqs-trimmed.qza
--verbose

And recieved this message:
== Read fate breakdown ==
Reads that were too short: 0 (0.0%)
Reads discarded as untrimmed: 50,774 (56.9%)
Reads written (passing filters): 38,497 (43.1%)

50% reads discarded seems to much for me
That means that i do not have primers in my sequences actually?
I think this will be probably related to the --p-front parameter, maybe Im trimming sequences that are not really primers.
For this analysis, I think I will opt for using trunc length in the next trimming step, for removing the initial (lets say) 10 bp that showed low quality scores. For the moment!

what could be happening? Thank you so much

SoilRotifer · July 12, 2023, 6:19pm

Potentially.

If you look at the verbose output, you'll see a list of numbers. These indicate where in the sequence many of the primers were trimmed. If they are not regularly being trimmed from the 5' end, then that is a good indication that the primers are already trimmed.

Alternatively, your sequencing company may be sequencing with a mixed-orientation protocol. That is not all the reads in R1 are actually from the forward primer, but the reverse primer too. You can test this by re-running entering both primers like so:

--p-front CCTACGGGNGGCWGCAG GACTACHVGGGTATCTAATCC

If your trimming summary improves... then this means your data is in mixed orientation. Meaning the forward and reverse reads are mixed between your R1 and R2 files. Then you'll need to find a way to get these reads in the same direction.

If nothing improves, then it is likely that the sequencing protocol used does not read through the primer. You sequencing company should provide you with the details on this.

Cele_Blua · July 16, 2023, 8:43pm

Ive tryed this and Ive found:
1- Using forward primer only [--p-front CCTACGGGNGGCWGCAG], the verbose output was
=== Summary ===
Total reads processed: 89,271
Reads with adapters: 38,497 (43.1%)

== Read fate breakdown ==
Reads that were too short: 0 (0.0%)
Reads discarded as untrimmed: 50,774 (56.9%)
Reads written (passing filters): 38,497 (43.1%)

Total basepairs processed: 26,752,843 bp
Quality-trimmed: 0 bp (0.0%)
Total written (filtered): 10,608,601 bp (39.7%)

=== Adapter 1 ===
Sequence: CCTACGGGNGGCWGCAG; Type: regular 5'; Length: 17; Trimmed: 38497 times
Minimum overlap: 3

2-Using revers and forward primers [--p-front CCTACGGGNGGCWGCAG ACTACHVGGGTATCTAATCC], the verbose output was
=== Summary ===

Total reads processed: 89,271
Reads with adapters: 74,704 (83.7%)

== Read fate breakdown ==
Reads that were too short: 0 (0.0%)
Reads discarded as untrimmed: 14,567 (16.3%)
Reads written (passing filters): 74,704 (83.7%)

Total basepairs processed: 26,752,843 bp
Quality-trimmed: 0 bp (0.0%)
Total written (filtered): 20,758,379 bp (77.6%)

=== Adapter 1 ===
Sequence: CCTACGGGNGGCWGCAG; Type: regular 5'; Length: 17; Trimmed: 38470 times
[...]
=== Adapter 2 ===
Sequence: GACTACHVGGGTATCTAATCC; Type: regular 5'; Length: 21; Trimmed: 36234 times

Based on this, would it be correct to think that the company used a mixed sequencing protocol? I understand that this is the case, so here I would start a new adventure. Hope not!! haha

Cele_Blua · July 16, 2023, 9:08pm

I finally guess that Mr DNA was the company which sequenced this amplicons. Ive found this mapping file within a folder: mapping.txt (198 Bytes)

Moreover, Ive analyzed the raw sequences through FastQC, and find out this results (for the forward reads only)

So I guess Ive got a tiny barcode attatched to the forward primer and mixed orentation sequences in my sample... is that right? that might explain some problems I ran into later.

Update: Ive tried my best by following the suggestions in this discussion

Ive re-imported data now as demux-paired end sequences. Then, applied cutadapt hopefully without trouble, by running this command:

qiime cutadapt trim-paired
--i-demultiplexed-sequences demux_seqs.qza
--p-front-f CAGTTCATCCTACGGGNGGCWGCAG GACTACHVGGGTATCTAATCC
--p-front-r GACTACHVGGGTATCTAATCC CAGTTCATCCTACGGGNGGCWGCAG
--p-match-read-wildcards
--p-match-adapter-wildcards
--p-discard-untrimmed
--o-trimmed-sequences demux_seqs-trimmed.qza
--verbose > cutadapt-log-2.txt

Then, I dont understand wich parameters should I enter in the next step to orient the output from cutadapt:

qiime rescript orient-seqs
--i-sequences demux_seqs-trimmed.qza \ #the input will be my demux_sequences trimmed by cutadapt
--i-reference-sequences reference-sequences.qza \ #which will be my reference?
--o-oriented-seqs oriented-query-sequences.qza \ #Here I should find forward or reverse reads?
--o-unmatched-seqs unmatched-sequences.qza #What should I found in this ouput?

Thanks a lot for your help!!

By the way, I dont understand why do I have to reorient my sequences. Im working based on the assumption that forward and reverse reads do not overlap (got ~460bp fragment sequenced by Illumina MiSeq). Thus, I guess that forward reads cannot contain the reverse primer (even in the opposite orentation) because polimerase couldnt go so far. Is that right?

Update2: Ive tried my best with cutadapt by following this pipeline
https://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing-paired-end-reads-in-mixed-orientation
Then I guess my forward reads correspond to 'round1_R1.fastq' and 'round2_R2.fastq' files. My next step will be to import this files to qiime as multiplex data, so they will merge on the same file. And finally get my (really) forward reads file. Then I would follow the single end pipeline Im trying to apply since ive started my analysis haha.

Am I in the rigth direction?

Sorry about such amount or replies, (I try to work on this, not just asking). Hope this will help someone with this same issues!!

SoilRotifer · July 17, 2023, 12:46pm

Hi @Cele_Blua,

Good find! Yep, this will get you there.

Looks like you are on the right track to me.
-Mike

system · August 17, 2023, 6:47pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.