We have a question regarding the expected read length after merging. Here are a few details about the sequencing data we have available:
16s rRNA V1 and V2 regions were sequenced
Primers 7F (AGAGTTTGATCCTGGCTCAG) and 338R (TGCTGCCTCCCGTAGGAGT) were used. The primers were already removed from the reads when we received the data.
Sequencing length for the forward and reverse reads (after primers and barcodes were removed) was equal to 301bp (see demux_stats.qvz attached)
After denoising the samples using DADA2 the sequencing length is shorter than what I would expect. I did a calculation for the expected length of the sequence after merging (also see visualization below).
The forward length would be from 7(F) + 301 = to 308
The reverse length would be from 338(R) - 301 = to 37.
The overlap between the forward and the reverse should be 271bp (from base 37 to base 308), with a unique sequence of 30bp on each end
Hi @emntsha,
I think this is about the variation between lengths in 16s rRNA variable region. Like your math pointed out 338 - 7 = 331, these numbers are based off locations in the e. coli (I think) genome so the math is a good rough estimate but there is gonna be some variation. It looks like the majority of your features have a shorter region than e. coli.
When you look at your data2-stats are most of your sequences making it through the filters? As long as that looks good, I don't think this is anything to worry about.
Yeah although the numbers look good in terms of most of my sequences passed through the filters, I am still curious of why my calculation is different from the output from QIIME2...
Hi @emntsha,
Your math seems right to me. I believe that your results are different than you expect because there is variation in the length of the variable region across bacteria species. So even though your math based on e. coli's genome says that the region should be 331, not every bacteria will have that length of variable region. It seems like on average the bacteria you are looking at have a shorter region than e. coli. This is resulting in your sequences being shorter than expected.
Just wanted to add to the excellent @cherman2 explanation.
I work with almost identical primers (small differences in F sequence) and have various datasets (microbiomes from humans, dogs, cows, lambs, and more). The average sequence length varies between datasets, from 312 to 320. So I guess the natural variation of V1-V2 region lengths is a reason why your calculations and actual data do not match with the latest depending on microbiome composition of samples.
Ah thanks so much to both @cherman2 and @timanix !
And then a follow up question would be then should we do filtering at all given that the V1-V2 region lengths naturally vary? If we still need filtering, what range would you suggest to be a good range for this region?
I think it is optional but I do not filter sequences based on the length - after dada2 I filter out rare sequences (with counts less than 10 or 50) and sequences that are found only in less than 5-2 samples, then perform taxonomy annotation and filter again to get rid of unassigned sequences and sequences from organelles.
Yes, from my experience merged amplicons from 16S data are always variable in length. Maybe I am wrong, but I expect that all sequences that should not be in my dataset would not pass Dada2 filters, would not merge, would be discarded at the cutadapt step (I am discarding sequences with no primers), would not be abundant, would not be presented in many samples or would not be properly annotated. But It is my opinion.