Hello dear QIIME 2 community ![]()
I am analyzing data that was created by sequencing the 16S V3-V4 region on an Illumina NovaSeq 6000 creating 250bp paired-end reads (I am aware that sequencing this region with relatively short reads creates some challenges, but that's what I have).
The sequencing was performed at Novogene and they provide their standard bioinformatics pipeline included in the sequencing package. However, since I need to combine data from several sequencing runs and further process the data, I need to do some steps of the analysis myself.
Here are the processing steps of the data I received: barcodes and primers were removed with cutadapt, reads were merged using FLASH, low-quality reads were filtered using fastp, chimeras were removed using vsearch. After all of this, the data was imported into QIIME 2 and denoised using the dada2 plugin to obtain ASVs.
As I was reading about all of the steps listed above to decide at which step I should feed the data into my own analysis, I learned that it is generally advised to run dada2 on unprocessed, unmerged reads, see for example the followind forum posts:
Can DADA2 be carried on pre-joined reads - User Support - QIIME 2 Forum
Is my Dada2 output normal? - User Support - QIIME 2 Forum
Determining correct pre-processing pipeline for - User Support - QIIME 2 Forum
Additionally, I was confused about why chimera removal is done twice, once by vsearch and then again by dada2.
I asked the bioinformatics team at Novogene about the reasoning behind this. Here are their responses:
-
They first use FLASH to merge paired-end reads to generate longer consensus sequences and then run dada2 denoise-single because it increases the overall merging rate. This prevents substantial read loss, which commonly occurs with DADA2’s built-in mergePairs function due to poor sequence quality or variable amplicon lengths.
-
Chimera removal is performed using vsearch by comparing against the Silva database. Dada2 then performs reference-free chimera removal. This approach minimizes false-positive ASVs as much as possible.
So now my question is: what do the more experienced microbiome researchers in the community think about this? Should I trust the pre-processing steps and simply start with the output artifacts from dada2 I received and merge them according the tutorial here? Or is it better to go back to the raw data with just adapters and primers removed, import into QIIME 2 and re-run dada2 myself?
Thanks a lot in advance for any insights on this topic!
