Benchmarking alternative methods of read-joining in QIIME 2

Has any benchmarking been done for the tutorial on Alternative methods of read-joining in QIIME 2? I would like to know if joining paired-end reads that have lots of overlap (e.g. F and R 150 bp reads sequenced from V4 region of 16S rRNA) is "better" than running Deblur only on forward reads. By better, I mean better at not only reducing error rate, but also improved quantifying the microbiome taxonomic abundance and diversity, ideally.


According to DADA2's original paper, merging before denoising will disrupt the DADA2 algorithm. However, Deblur does not have a similar discussion in their paper.

Hi @amirza,

Can you clarify what you mean by

Deblur operates on single-end reads utilizing a pre-trained error model, so if you were to merge your reads before providing it as input to Deblur it will just treat it as a regular "longer" sequence and denoise accordingly. In theory you can improve quality scores in the overlap region, and in fact some tools re-calculate the quality score when this happens, however this doesn't matter with Deblur because quality scores are not utilized for the denoising step. The other consideration is that Deblur does tend to become more conservative as read length increases (see example calculation here), so you will actually end up with less reads than if you were to just use the forward reads by themselves (assuming they are a bit shorter than merged). Intuition says you may gain slight improvements with taxonomic resolution with increase in read length, but to be honest I'm not sure I've actually seen this benchmarked anywhere, and when we're discussing short reads (aside from this old paper from 2007, Fig 1), I really don't think the difference between 150 vs 180 nt is going to have any noticable effects. At that point I think I'd prefer having more reads than longer reads but that is also very much so data-dependent.


Dear @amirza,

I'm unaware of a benchmark that has definitively shown that read stitching is always better than just using forward reads. It depends on what you want to do, and it may have no impact on the biological conclusions you derive from alpha and beta-diversity. The easiest thing would be to try with and without, see if the conclusions differ (and if so, in a meaningful way?)