Separating Amplicons by Primer and Denoising

Rach3l · November 30, 2021, 12:13pm

I am working with paired end fastq sequence data containing amplicons generated using several primers. I have tried two different methods to separate the amplicons by primer and to denoise but I am getting poor results.

For the first approach I used cutadapt to select the ZBJ amplicons and to cut off the ZBJ primers. The adapters were found in around 90% of the reads and 87% passed filtering.

qiime cutadapt trim-paired
--i-demultiplexed-sequences data.qza
--p-front-f AGATATTGGAACWTTATATTTTATTTTTGG
--p-adapter-f GGAGGATTTGGWAATTGATTAGTW
--p-front-r WACTAATCAATTWCCAAATCCTCC
--p-adapter-r CCAAAAATAAAATATAAWGTTCCAATATC
--p-match-read-wildcards
--p-match-adapter-wildcards
--p-discard-untrimmed
--o-trimmed-sequences ZBJ.qza

However, I don't think this was done correctly as the ZBJ amplicon should be around 157bp in length and the plot show my sequences are up to around 230bp.

When I denoise these sequences with dada2 a high proportion of reads are removed during the initial filtering step. I have tried various dada2 settings including no trimming, trimming to 230bp and truncating from 130-220bp. Here is an example;

qiime dada2 denoise-paired
--i-demultiplexed-seqs ZBJ.qza
--p-trunc-len-f 150
--p-trunc-len-r 150
--p-trim-left-f 23
--p-trim-left-r 23
--p-n-threads 40
--o-table ZBJfeaturetable.qza
--o-representative-sequences ZBJrepseqs.qza
--o-denoising-stats ZBJstats.qza

As this gave poor results I decided to denoise the entire dataset.

From this plot of all the sequences I used the following dada2 settings and got good results when truncating to 150bp seen in the stats table below (and poor results truncated to 160bp).

qiime dada2 denoise-paired
--i-demultiplexed-seqs dataimported.qza
--p-trunc-len-f 150
--p-trunc-len-r 150
--p-trim-left-f 23
--p-trim-left-r 23
--p-n-threads 20
--o-table allfeaturetable.qza
--o-representative-sequences allrepseqs.qza
--o-denoising-stats allstats.qza

As these results were good I thought I would try separating the amplicons following dada2 using qiime feature-classifier extract-reads as suggested here

qiime feature-classifier extract-reads
--i-sequences allrepseqs.qza
--p-f-primer AGATATTGGAACWTTATATTTTATTTTTGG
--p-r-primer WACTAATCAATTWCCAAATCCTCC
--p-n-jobs 20
--o-reads ZBJrepseqextracted.qza

However, this only gave a sequence count of 97 so I don't think the ZBJ amplicons were extracted correctly. Could this be due to the primers being degenerate? I changed the --p-n-jobs parameter to 0.7 but it had little impact.

So my questions are;

What method should I use to separate my amplicons by primer?
Why am I getting longer than expected sequence reads after using cutadapt?
Why are so many sequences being filtered out during denoising after using cutadapt and what criteria does the first filtering step actually use?
Can qiime feature-classifier extract-reads be used with degenerate primers?

Any clarification on these topics is greatly appreciated!

thermokarst · November 30, 2021, 3:42pm

Hi there @Rach3l!

I think this might be a case for using what cutadapt calls "linked primers":

https://cutadapt.readthedocs.io/en/stable/guide.html#linked-adapters

If your sequence of interest is surrounded by a 5’ and a 3’ adapter, and you want to remove both adapters, then you can use a linked adapter. A linked adapter combines a 5’ and a 3’ adapter. By default, the adapters are not anchored, but in many cases, you should anchor the 5’ adapter by prefixing it with ^.

Without specifying linked adapters you would need to run cutadapt twice - once for the 5' adapters and once for the 3' adapters (because by default cutadapt only filters the first adapter found in a read). Using the linked adapter syntax should let you accomplish this filtering in one pass. You'll need to experiment a bit with this - you might need to anchor your adapters, as mentioned above. You might also be able to do this without the linked adapter approach, and instead using --p-times (I assume you would set that to 2, but I'm not 100% sure of that), too.

use linked adapters, or run q2-cutadapt twice, or try using --p-times

you're getting longer reads than expected because you aren't trimming all of the adapter sequence

so many reads are being filtered out because there is artificial/non-biological sequence in your reads, which is an issue for DADA2

We try to keep this forum to one thread per discussion topic, can you please post this as an independent question?

Keep us posted! :qiime2:

ps: if you run q2-cutadapt with the --verbose flag you'll get a really comprehensive filtering report from cutadapt that'll help steer your decision making - it'll tell you which adapters were found when filtering

Rach3l · December 2, 2021, 1:54am

Hi Matthew,

Thank you for your detailed response.

I think this might be a case for using what cutadapt calls "linked primers":

I think you are right, I checked the fastq files and it looks like the sequences contain both the 5' and the 3' primer. I tried the linked adapter approach and also using --p-times but I couldn't get these methods to remove the primers properly. So I ran cutadapt twice, once for the 5' primers and the second time for the 3' primers which managed to give me the 157bp amplicons that I was expecting and much better DADA2 stats.

We try to keep this forum to one thread per discussion topic, can you please post this as an independent question?

Yes, will do

Thank you again for your help.