Dada2 losing high amount of reads on denoise step

wjschuiten · November 6, 2019, 10:33am

Hey everyone,

I've been struggling with dada2 for the past weeks and while i think i have a basic understanding of the program, i am unable to get dada2 to not filter out >50% of my reads.

I'm running paired-end 18S illumina reads, the adapters and primers are trimmed off leaving 300 bp paired end reads. The quality plots look as follow:

[Album] ASV quality plots

I'm using a subset of two samples out of nine to configure dada2 parameters.

I entered the raw reads into dada2 without any pre-filtering, i ran the following command:
qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trunc-len-f 220 --p-trunc-len-r 160 --p-n-threads 4 --o-denoising-stats stats.qza --o-table dada-table.qza --o-representative-sequences rep-seqs.qza --verbose

The first sample had 60k reads, the second 70k reads, after filtering. ~45k of both were left, after denoising ~27k and after merging only 9k. So i decided to try the forward read.

I ran the following command for denoise single:
qiime dada2 denoise-single --i-demultiplexed-seqs paired-end-demux.qza --p-trunc-len 220 --p-n-threads 4 --o-denoising-stats stats.qza --o-table dada-table.qza --o-representative-sequences rep-seqs.qza --verbose

This led to the following stats:
|sample-id|input|filtered|denoised|non-chimeric|
|fun1|261792|222325|154986|123869|
|fun2|282574|235014|172750|144535|

I'm still over 50% of my data during the dada2 step it seems denoising is the biggest factor in the forward read run, but i am unable to improve the numbers by tweaking the filter and truncation parameter. I'm starting to feel the problem may be with the data itself, but i have no clue on where the problem may be. Any help on the issue is much appreciated. I have uploaded the files to my dropbox, they can be downloaded with the following link:

https://www.dropbox.com/s/tjf1uhfkx4jl7ja/FunSampleSubset.rar?dl=0

Mehrbod_Estaki · November 6, 2019, 10:09pm

Hi @wjschuiten,

I think your hunch is correct that the problem lies with the data itself. I draw this conclusion because as you mentioned even with single-end reads you are losing about half of your reads. You may be able to increase retention a little bit by truncating the forward reads a bit more say down to 180-220 but I imagine the gain would be minimal. Other options is to relax the maxee parameter (change it to say 4) and increase the # of training reads but you should practice caution with relaxing the maxee, usually the discarded sequences are discarded because they are low in quality. Nevertheless I would say that will be your best bet in increasing retained reads in a significant manner. Hope that helps, keep us posted!

wjschuiten · November 8, 2019, 8:03am

Hey @Mehrbod_Estaki

Thanks for your reply, i've done several more runs and it seems trimming the first 30-40 bp's off allows me to retain ~65% of my data at truncation 180-200 which is within acceptable values. Giving dada2 very generous quality arguments (trunc-q 0 and max-ee 20) barely affects the amount of filtered reads. Perhaps there are still some primers attached. I'll proceed with these results, thanks for your thoughts.

Wouter