Hello everyone!
I am a beginner in Qiime2 and currently doing some exploratory work using Qiime2 on my soil 16S dataset. I had a hard making a decision of whether to trim off the primers using Cutadapt before DADA2 or using the trim function in DADA2 straight. I am aware that this issue was brought up several times before, and the general consensus seems to favor the latter option of only using DADA2. However, my supervisor prefers trimming the primers with Cutadapt before preceding to denoising. Therefore, I ran a few trials to test if the results differ a lot using various approaches. And it seems they do vary quite a lot. I’d like to ask the experts whether the results make sense and which option should I stick to.
Some background:
- Sample type: soil
- Data type: Illumina MiSeq PE250x2
- Sample size: 147 (including 2 kit controls)
- Primer sets: (515F-806R)
- Forward: GTGCCAGCMGCCGCGGTAA
- Reverse: GGACTACHVGGGTWTCTAAT
I first examined the interactive plot of the demultiplexed data.
Figure1. Interactive plot (paired-end-demux.qzv)
Then I ran the following code to trim off the primers with Cutadapat.
qiime cutadapt trim-paired
--i-demultiplexed-sequences paired-end-demux.qza
--p-cores 4
--p-front-f GTGCCAGCMGCCGCGGTAA
--p-front-r GGACTACHVGGGTWTCTAAT
--p-discard-untrimmed
--p-no-indels
--o-trimmed-sequences reads_trimmed.qza
Here are the plots obtained after trimming.
Figure2. Interactive plot (reads_trimmed.qzv)
According to the plots above, I ran 4 trials using the listed parameters below.
Table1. Comparison of trim and truncation parameters used in four approaches
The following are the results of the “table summary” section from table.qzv files produced by each approach.
Table2. Comparison of the “table summary” section.
Moving forward, I examined the denoising-stats.qzv files and calculated the averages and SD of the output parameters (n=147).
Table3. denoising-stats.qzv file summary
My question is why would these methods yield results with such huge differences, which parameters should I value the most to determine the optimal method for my dataset? Sorry for the lengthy question...
Thank you!