To remove primers and spacers or not? - Zero usable reads after DADA2

Hello Forum!

I have been running into this issue and have yet to find a good solution. Here is an outline:

Seq Specs: MiSeq, V4-V5 region (515F-806R primers) PE 250 bp.
QIIME2 Specs: conda installation, 2019.4, using cluster computing

Our sequencing core demultiplexes the data before returning the reads. They also employ the use of heterogeneity spacers. Thus the reads we receive back still have the spacers and the primers in the sequences. Thus the amplicon read looks roughly like this:

R1: 5' Spacer - 5' Primer Sequence (bacterial 16S sequence) - sequence of interest

R2: 3' Spacer - 3' Primer Sequence (bacterial 16S sequence) - sequence of interest

Following the advice of the forum, using the cutadapt plugin, I removed the primers (and thus the spacers too), but when the DADA2 step comes, I essentially get 0 reads (commands below):

trim primers and spacers
qiime cutadapt trim-paired
--i-demultiplexed-sequences ./raw_artifacts/demux.qza
--p-cores 20
--p-front-f GAGTGCCAGCMGCCGCGGTAA
--p-front-r ACGGACTACHVGGGTWTCTAAT
--p-error-rate 0.1
--p-discard-untrimmed False
--o-trimmed-sequences ./primer_trimmed_seqs/trimmed-seqs.qza
--verbose

truncate and denoise with DADA2
qiime dada2 denoise-paired
--i-demultiplexed-seqs ./primer_trimmed_seqs/trimmed-seqs.qza
--p-trim-left-f 0
--p-trunc-len-f 245
--p-trim-left-r 0
--p-trunc-len-r 220
--p-n-threads 0
--o-representative-sequences rep-seqs.qza
--o-table table.qza
--o-denoising-stats ./dada2-denoise-dir/stats-dada2.qza
--verbose

And here is the head of the output I get from the DADA2 denoising:

And here is the quality visualization:

Now, here is the issue: I have heard conflicting advice from people at my University and here on the forum - is it best to trim primers off prior to DADA2 denoising, or should I leave the primers on? It seems the consensus on the forum is to remove primers. The consensus at my University is to leave the primers on to help align the sequences, since the primers are the conserved region, and thus using the conserved portion to align highlights the differences in the variable regions.

Should I leave the primers, and just trim the spacers, or do you think something else is going on here? Returning effectively zero sequences after denoising and chimera removal tells me something isn't right here.

Removing just the spacers is tricky, though, because the sequencing core employs 4 different spacers, thus I would have to run cutadapt four separate times on the four sets of reads with different spacers, and thus run DADA2 four different times. I would think this would mess with modeling used within DADA2, and thus the error rate and filtering, chimera removal, etc. would be modeled on a subset of the data, and not the entire dataset.

Any help is greatly appreciated!

2 Likes

Maybe this can help? it looks like you have sequences on input, but the quality plots look suspect. Is there proper overlap between the forward and reverse reads?

Hi @ben, Thank you for bringing this post up! I had seen this in the past. I have analyzed this data in many different ways (i.e. with primers, without primers, leaving most of the bases and truncating to the minimal allowed overlap, etc.). In my most recent run, that I showed above, I truncated to F 245 and R 220. The V4-V5 amplicon with the 515F/806R primer set is ~300-350 bp, so that would be ~115 bp overlap.

I am currently running it again and trimming more aggressively to F 200 and R 180 which will give me ~30 bp overlap. Will report back when it is done.

In regards to my other question, could you provide some guidance: what are your thoughts on trimming primers versus removing primers prior to merging?

Thanks!

I think you need to remove them before the DADA2 step to be honest. You saw my last post where I wantonly went looking for primers/adapters that weren't there. I am not sure about the V4V5 region as I have been trying to help folks with the V4 and V3V4 region, but I would just make sure that the 515F/806R is correct for the V4-V5 region, I think the 515F to 806R is specific for the V4 region. If that's the case, then the V4 region is approximately 250bp and your F and R amplicons should be around 150 bases.

I think that the 515f/926r primers are correct for the V4V5 region as this publications has used:

https://msystems.asm.org/content/msys/1/1/e00009-15.full.pdf

qiime dada2 denoise-paired
–i-demultiplexed-seqs ./primer_trimmed_seqs/trimmed-seqs.qza
–p-trim-left-f 0
–p-trunc-len-f 150
–p-trim-left-r 0
–p-trunc-len-r 150
–p-n-threads 0
–o-representative-sequences rep-seqs.qza
–o-table table.qza
–o-denoising-stats ./dada2-denoise-dir/stats-dada2.qza
–verbose

I would try it with these changes. Ben

qiime cutadapt trim-paired
–i-demultiplexed-sequences ./raw_artifacts/demux.qza
–p-cores 20
–p-front-f GAGTGCCAGCMGCCGCGGTAA
–p-front-r ACGGACTACHVGGGTWTCTAAT
–p-error-rate 0.1
–p-discard-untrimmed False
–o-trimmed-sequences ./primer_trimmed_seqs/trimmed-seqs.qza
–verbose

What is your output with this? If we're running Illumina, you may not find any primers/adapters.

Here is more detail about the primer set you've been running:

Ben

Thank you for pointing out my mistake, you are correct the 515F/806R is specific to the V4 region only. From what I have read, and including the EMP protocol, it looks like the expected amplicon size is ~300-350 bp from the 515F/806R primer pair, and like you said the V4 region is 250 bp.

The reads we did were indeed PE 250 bp reads, meaning we should have LOTS of overlap between the two sets. Which brings up another question - do you make your calculation for overlap based on the length of the V4 region (~250 bp), or based on the expected amplicon size(~350 bp)?

Connor

I do it to the V4 region not the expected amplicon size because the longer the reads are the more likely there is poor quality. If you are merging, what happens is that I think DADA2 is expecting only an overlap of 20-50bp/nt thus you have a forward amplicon of 150 and reverse amplicon of 150 with good quality you get F forward 150 nt and R reverse 150 nt and you have an overlap of 50nt.

However, if you’re using DADA and you have F forward 250nt and R reverse 250nt thus may be getting filtered out, but you have to troubleshoot - I have no idea what is going on under the hood.

edit: I think, this is just a guess, that all of your amplicons are getting filtered out because they are poor quality around the areas you pick to trunc

Ben

1 Like

You are absolutely right on the filtering aspect. I wasn’t taking into account that 1) the F & R reads were very low quality greater than 200 nt in length and 2) since I used cutadapt to remove primers and spacers prior to DADA2, my F reads were now only ~220 nt and my R reads ~210 nt, so with my truncate length in the original script of F @ 245 and R @ 220, I was losing almost all reads at that point anyways since the reads after cutadapt were already shorter than my truncate length.

I re-ran everything with the DADA2 forward and reverse truncate length at 200 and 180 nt respectively, and got what seem like more reasonable results. After filtering, denoising, and chimera removal, I am left with ~40-60% of sequences for most samples. I would hope for more retention, but I may have to play with the DADA2 parameters a little more.

I am still a little concerned about orienting the overlap based on V4 size (250 nt) versus amplicon size (350 nt). My thought is, and this is just a guess, we should take into account the full amplicon size, because we are sequencing the entire amplicon, not just the V4 region, correct? so if we truncate both reads too short, then there won’t be any overlap. It looks like @Mehrbod_Estaki summed it up well in this post where he said to use the amplicon size: Low frequency counts after DADA2 (7%).

2 Likes

Hey Connon,

From what I have read, and including the EMP protocol, it looks like the expected amplicon size is ~300-350 bp from the 515F/806R primer pair, and like you said the V4 region is 250 bp.

If you are using the EMP primers and sequencing protocol, the amplicons will be 251 bp long. If you sequence with 150 paired end chemistry, they will overlap by ~50 bp (150x2 - 250 == 50), and if you sequence with 250 paired end chemistry, they will fully overlap.

Of course, this depends on your primers and sequencing method! You may have sequenced something larger.

Colin

2 Likes

Thanks for the update, to be honest, I don’t think you would get any more reads at this point because it looks like the quality of your forward and reverse reads don’t vary greatly across the bases. So, to humor me, you could run DADA2 with 150nt forward and reverse, but I don’t think you would get better recovery than you are now. Ben

2 Likes

Sorry about the delay response in this. I would agree with @ben on this that your quality scores are unfortunately tanking too early for your current truncating parameters. As the stats results show you are losing most of your reads at the filtering step which has nothing to do with overlap, that’s a later process. Personally I wouldn’t really bother with trying to merge them, you will have much better retention if you used only one set of reads, in your case the reverse actually looks better than forward. You can probably set your truncating at 150 safely on the Reverse, and maybe gradually move it up 10-15 at a time if you really care about the length. From personal experience you are going to see very little benefit from say 150 bp read vs 170 or 190, especially in the V4 region.

As for removing your primers or keeping them in, I’m not sure if this has been analytically shown to be beneficial in short reads, in contrast though removing primers does seem to improve taxonomic classification later, albeit very little. Feel free to experiment with both and share your results! Would be curious to know.

2 Likes

@ben truncating to 150 nt for both the F and R reads I was able to gain roughly 10-15% more recovery (!!) in all samples when compared with truncating at F 200 nt and R 180 nt. I have both stats outputs below:

F 200 nt & R 180 nt:

F 150 nt & R 150 nt:

Thanks for that tip.

@Mehrbod_Estaki, with these stats, do you think I should still throw away my F reads and only use my reverse? Or do you think I should still try to merge?

Thanks!

2 Likes

:partying_face::partying_face::partying_face::partying_face::partying_face::partying_face:

edit: My suspicion is that while the quality does matter, the quality helps with the merging. But you have to truncate the sequences while preserving quality. Ben

1 Like

Thanks for all your help, much appreciated, Ben!!

1 Like

You are losing a good number of reads at merging for a select number of samples. This is worse than losing reads at filtering (since reads lost at filtering are randomly distributed, but reads lost at merging will belong specifically to any species with longer V4 domains so you are skewing the community in those samples.

So I would recommend bumping up the truncation lengths just a little bit... maybe 160 + 150 and see if that is enough.

4 Likes

I agree with @Nicholas_Bokulich’s points. The potential bias that can be introduced because of the merging issue should be considered. Using reverse only reads will be safer if that is a concern, and I’m certain you will retain even more reads that way, but you do have good numbers currently so that shouldn’t be your only motivation there. It’s always fun playing with those parameters! A lot to be learnt from just fine-tuning them

3 Likes

Thanks @Mehrbod_Estaki & @Nicholas_Bokulich for the insights! I will play around with the parameters a bit and see how it changes the data. Really appreciate all your help!

2 Likes

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.