Hello there everyone,
I am a student have a theory question I am struggling to find an answer to.
I understand that the variable/conserved regions are found in all 16s rRNA gene sequences and it is for this reason that we're able to use it to differentiate between bacteria using this method
However, after some research, it looks like, there are no defined nucleotides that, for example, denote the V3-V4 region.
This paper states that the 16S rRNA gene primer sets v3-V4 are 341F/785R and V4-V5 are 515F-Y/926R.
Yet, another paper titled Strategy for microbiome analysis using 16S rRNA gene sequence analysis on the Illumina sequencing platform L. ram et al 2011. Says that the V3 starts at 430C and V4 ends at 651G.
Would anyone please explain this to me? My thought is that since it is a biological system, it probably isn't exact and so the area for these "regions" will only be a general length/area and not specifically X nucleotide --> X nucleotide.
If this is the case, then is there a reliable source that I can use to define these regions so I can choose them when doing my analysis?
You're very right about it not being exactly defined - it is biology, after all!
The nucleotide numbering commonly used to name primers comes from Escherichia coli. Not only do the sequences of the hypervariable regions vary, but their lengths vary too, so the absolute positions of the hypervariable regions vary between different taxa.
To answer your last question, I don't know a source that would already have all that information for all the taxa that you're interested in, but because of the length diversity, in my opinion, your best bet for accurately handling these sequences is using relevant (degenerate) primer sequences and RESCRIPT. You could also make a multiple sequence alignment and extract the relevant regions like that, but I haven't tried that before.
As an illustration:
When I was considering truncation lengths to use for DADA2, I wanted to know how long the V3-V4 region is, so I could ensure I had enough overlap between forward and reverse reads. Using the most recent SILVA database, RESCRIPT, and the primers for V3-V4 regions, I extracted the V3-V4 regions from the 16S sequences. From there I was able to estimate the variation of lengths between different taxonomic groups. For example, the V3-V4 regions from the class Clostridia were around 404 nt long (mode of the length distribution), whereas for the class Bacilli, the mode length was around 429 nt.