I am going through the usual QIIME2 pipeline and I noticed that DADA2 denoise is removing almost 50% of the data (as seen in the denoising_stats.qzv file).
I tried running the same data using a different tool and it does not end up removing the same amount of data as DADA2. I used cutadapt to remove primers prior to running DADA2, using the command:-
qiime cutadapt trim-single --i-demultiplexed-sequences demux_seqs_original.qza --p-cores 8 --p-minimum-length 350 --o-trimmed-sequences demux_seqs.qza --p-adapter CCTACGGGAGGCAGCAG...ATTAGAWACCCBDGTAGTCC --p-discard-untrimmed
The DADA2 command I use is:-
qiime dada2 denoise-single
--p-n-threads 200
--i-demultiplexed-seqs ./demux_seqs.qza
--p-trunc-len 0
--output-dir DADA2_denoising_output
--verbose
&> DADA2_denoising.log
I have a few questions:-
Is it possible to know how specifically denoising is removing so many sequence reads from the samples? I dont see quality as a big issue as I am already removing low-quality data and primer sequences from the dataset.
Is it possible that all of the removed data is basically chimeras and they are just reported like this? I say this because within the denoising report, I don't see many chimeras being detected within the samples.
Yes! In newer version of Qiime2 and the q2-dada2 plugin, more detailed statistics are reported about this process. When you upgrade from 2021.11.0 to version 2022.11, as an example, the dada2 output will include percent reads removed due to joining and chimera filtering as separate columns, which should answer you question.
That is possible!
I'm guessing something is wrong with read joining, but you will have to update and rerun DADA2 to find out!
Thank you for responding. I took your advice and updated my qiime to version 2022.11. It ran successfully and generated a log file. But I am not seeing any extra information like you mentioned. It shows the same columns it showed with the previous version. Is there a different place I got to look for the detailed statistics? Thank you for your help. denoising_stats_v2.qzv (1.2 MB) DADA2_denoising_v2.txt (1.7 KB)
denoise_single will not have columns for joining, as joining/merging only applies to paired-end reads.
This means that the majority of your read losses are in the main dada() error correction / removal step (code here), just like you found when running the older version of DADA2.
I'm not sure what's causing this, or why some samples are more affected than others.