Huge difference of DADA2 output with and without removing primers

Hi Community,

I would like to gain some advices on the use of cutadapt and DADA2. I read several similar posts on the forum but my situation seems a little bit different from the previous discussion. (Qiime version: amplicon-2025.7)

I accidentally process raw sequence of a metabarcoding dataset (COI region, amplicon length ~200bp; sequencing done on Illumina pair-end PE150) without removing primers. The subsequent analysis are smooth, until I realize I haven’t remove the primers; then I repeat the process from the beginning.

After removing primers using cutadapt, nearly all of the sequences were filtered in DADA2. Subsequent analysis is not possible.

Here is the script used in cutadapt:

qiime cutadapt trim-paired \
--i-demultiplexed-sequences subset1.qza \
--p-front-f GGWACWGGATGAACWRTHTAYCCHCC \. # Forward primer sequence from 5’-3’
--p-front-r GTRATDGCWCCDGCWARWACWGG \. # Reverse primer sequence from 5’-’3
--p-cores 10 \
--p-match-adapter-wildcards \
--o-trimmed-sequences subset1_trimmed.qza

I compare the summary demultiplexed sequence counts, forward/reverse counts of .qza with and without removing primers, they look identical. Hence, I think the problem isn’t associated with cutadapt.

Here attach the summary (.qzv) as reference, where primers were trimmed in “subset1_trimmed.qzv"

subset1_trimmed.qzv (348.2 KB)

subset1.qzv (332.2 KB)

Since the quality of bases look good, I haven’t trimmed any bases in DADA2, here is the script:

qiime dada2 denoise-paired \
--i-demultiplexed-seqs subset1.qza \
--p-trim-left-f 0 \
--p-trim-left-r 0 \
--p-trunc-len-f 150 \
--p-trunc-len-r 150 \
--p-n-threads 10 \
--o-representative-sequences rep-seq_subset1.qza \
--o-table feature-table_subset1.qza \
--o-denoising-stats stats_subset1.qza \

And here is the output of DADA2 with and without removing primers:

stats_subset1.qzv (1.2 MB)

stats_subset1_trimmed.qzv (1.2 MB)

Although in stat_subset1.qzv, most of the sequences were filtered, the remain sequences look having high quality, subsequent analysis were done smoothly with reasonable outcome in taxonomy analysis. However, in stat_subset1_trimmed.qzv, almost no sequences were left.

I tired adjusting the parameters in --p-trunc-len-f and --p-trunc-len-r; but seems it wasn’t the source of problem. May I have some advice on the step? Would trimming primers after DADA2 an feasible option for accurate taxonomy assignment?

Any help are highly appreciated. Thanks.

1 Like

Hello,

Thank you for detailed description!

I think your issue is because you sequenced your library with 150 x 2

And then in dada2 you provide

After primer removal, your reads are shorter than 150, and they just get discarded at filtering step

Consider changing the truncation parameters to account for the shorter length of reads without primers by providing lower values or disabling it, as well as minimum overlap to make sure that reads still merge (decrease if needed).

Hope that helps.

Best,

3 Likes

Thank you for your prompt reply. Glad that it is a parameter issue.

I tried to ajust the –p-trunc-len parameter a bit, I try somewhere around 125, but yield the same outcome.

I am a little bit confused on some concepts:

  1. Since the length of forward primer is 26; and the reverse is 23. Does it mean that the theoretical maximum length I can try is:
    --p-trunc-len-f 124 \ (150-26 = 124)
    --p-trunc-len-r 127 \ (150-23 = 127)

  2. To account for losing some of the bases for whatever reasons, an example of “safer choice” will be:
    --p-trunc-len-f 118
    --p-trunc-len-r 118
    In this case, assuming the amplicon is ~200bp, the number of bases overlapping will be:
    118+118-200 = 36bp, which looks like a safe length for overlapping

  3. You suggested "Consider changing the truncation parameters to account for the shorter length of reads without primers by providing lower values or disabling it”. What do you mean by disable?

    Am I on the correct path? Thanks!!!

Hello,

Yes, you are getting it right.

You need either to try with lower values (110?), or set it to 0 to disable truncation.

In theory yes, but there is always some variability in practice. Also, not sure how variable is the gene / region you are targeting across different organisms.

Yes, and if you go for lower values you can always decrease minimum overlap from 12 (default) to lower value.

If I remember the documentation correctly, setting it to 0 will disable it, so all the reads will pass without truncation. Truncation is desirable when the quality is declining to the end of the reads. Please check it in the plugin help message.

Best,

3 Likes

Thanks! The recommendation works perfectly.