I’ve got some questions about primers and about calculating overlap for paired-end reads. I hope I’m posting in an appropriate place. Sorry, if not. This is my first post.
We’ve done some 16S rRNA amplicon sequencing with primers that Illumina says are used to sequence the V3 and V4 variable regions of the 16S rRNA gene. The Illumina guide can be found here.
I’ve noticed that the last 17 nt of the forward Illumina primer and the last 21 nt of the reverse Illumina primer can be found in the following three papers (and maybe others):
Herlemann, D. P., Labrenz, M., Jürgens, K., Bertilsson, S., Waniek, J. J., & Andersson, A. F. (2011). Transitions in bacterial communities along the 2000 km salinity gradient of the Baltic Sea. The ISME journal, 5(10), 1571-1579.
Klindworth, A., Pruesse, E., Schweer, T., Peplies, J., Quast, C., Horn, M., & Glöckner, F. O. (2013). Evaluation of general 16S ribosomal RNA gene PCR primers for classical and next-generation sequencing-based diversity studies. Nucleic acids research, 41(1), e1-e1.
Thijs, S., Op De Beeck, M., Beckers, B., Truyens, S., Stevens, V., Van Hamme, J. D., ... & Vangronsveld, J. (2017). Comparative evaluation of four bacteria-specific primer pairs for 16S rRNA gene surveys. Frontiers in microbiology, 8, 494.
As far as I can tell, the Herlemann et al. paper is the original source of these sequences. That paper reports primers called Bakt_341F (CCTACGGGNGGCWGCAG) and Bakt_805R (GACTACHVGGGTATCTAATCC). The amplicon size isn’t given, but the targeted variable regions of the 16S rRNA are given as V3 and V4.
The Klindworth et al. paper is cited by the Illumina guide referenced above. That paper reports primers called S-DBact-0341-b-S-17 (5’-CCTACGGGNGGCWGCAG-3’) and S-D-Bact-0785-a-A-21 (5’-GACTACHVGGGTATCTAATCC-3’), which are said to have an amplicon size of 464 bp. No details are given as to which variable regions of the 16S rRNA gene are targeted.
The Thijs et al. paper reports primers called 341f (CCTACGGGNGGCWGCAG) and 785r (GACTACHVGGGTATCTAATCC). 341f is said to target positions 341–357 in E. coli, and 785r targets positions 785–805. The amplicon size is given as 444 bp. The targeted variable regions of the 16S rRNA gene are given as V3–V4.
- My understanding is that the first 33 nt of the forward Illumina primer and the first 34 nt of the reverse Illumina primer each comprise a Nextera adapter, which overhangs (i.e. does not bind to the 16S rRNA gene). In that case, are the last 17 nt of the forward primer and the last 21 nt of the reverse primer the regions that bind to the 16S rRNA gene? Figure 1 in the Illumina guide seems to suggest that they are.
- In a forum post about these primers, @benjjneb recommended removing “the entire primers” in the DADA2 denoising step because “the primers aren’t sequences from the sample”. Does he mean that although the 17-nt and 21-nt regions are designed to bind to the 16S rRNA gene, they should be removed in the DADA2 denoising step because they may not always 100% exactly match each 16S rRNA gene sequence that they bind to? That would make sense to me.
- The primer pairs in the three papers I mentioned are all identical, yet they all have different names, which is confusing. What do the numbers 341, 785 and 805 mean? Do these refer to the positions where the primers bind to the 16S rRNA gene in the E. coli genome? I guess that would make sense, because presumably 785 and 805 would then refer to opposite ends of the reverse primer.
- The Klindworth et al. paper gives an amplicon size of 464 bp for these primers, whereas the Thijs et al. paper gives an amplicon size of 444 bp. If the primers, plus the sequence they amplify, span from positions 341 to 805, I guess the correct amplicon size should be 805 − 341 = 464 bp, right? Also, what does this number mean exactly? Is it the amplicon size expected if the primers are used to amplify from the E. coli 16S rRNA gene? And, if using these primers to sequence an environmental sample, would we then get a range of slightly different amplicon sizes (because the 16S rRNA gene varies slightly in different species)? In other words, is 464 bp kind of an approximate number?
- In this forum post, @colinbrislawn gave this kind of formula for calculating overlap: (length of forward read) + (length of reverse read) − (length of amplicon) = length of overlap. So, in the case of 2x300 bp paired-end Illumina MiSeq sequencing, we’d have 300 + 300 − 464 = 136 bp overlap, to start with? And then, am I right in thinking that the
qiime dada2 denoise-pairedbasically don’t impact the overlap (because they act at the 5’ end of the reads), but
--p-trunc-len-rdo (because they act at the 3’ end of the reads)? If so, then I guess the formula for the final overlap would be: (length of forward read) + (length of reverse read) − (length of amplicon) − (length of forward read −
--p-trunc-len-fvalue) − (length of reverse read −
--p-trunc-len-rvalue) = overlap. So, for example, if we picked
--p-trunc-len-f= 280 and
--p-trunc-len-r= 250, we’d have 300 + 300 − 464 − 20 − 50 = 66 bp overlap?
Apologies for the ungodly length of this post, and the potentially very noob questions. Perhaps this kind of 'thinking out loud' post will be somehow useful to other noobs in the future. Anyway, big thanks to the team for developing this amazing tool and for maintaining this very useful forum. You are appreciated, and we are grateful! Keep up the great work!