Denoising issues in Dada2

kkrigul · September 3, 2020, 10:03am

I am having difficulties denoising my paired-end data as the quality of each base drops also in the middle of the reads. I am seeing quite a drop after filtering and merging step. Is this ok or should I only use forward reads? If I only use forward reads then will I be missing some data later in taxonomic assignment step or alternatively, if I leave the reverse reads in the analysis, will I be seeing falsely identified taxa? What are your recommendations based on my results

I am using QIIME2 version 2019.7 installed on my workplace server.
Sequencing length for each read is 251 bp. So our paired-end reads covering the V3-V4 regions using primers 337F (16S_F CCTACGGGNGGCWGCAG) and 805R (16S_R GACTACHVGGGTATCTAATCC) should yield in a sequence with a length 805-337 = 468. Overlapping bases should therefore be 502 (sequencing lengths added up) – 468 (amplicon length)= 34

The command I was using is: denoising-stats.qzv (1.2 MB) visualisatsioon.qzv (295.2 KB)
qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 15
--p-trim-left-r 12
--p-trunc-len-f 250
--p-trunc-len-r 247
--p-n-threads 20
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

I am adding visualization of the quality plot of forward reads and denoising stats. Thanks for your help!

kkrigul · September 7, 2020, 9:10am

I also add the data what I got when I used only forward reads:
denoising-stats-forward.qzv (1.2 MB) rep-seqs-forward.qzv (2.0 MB)

Command that I used was this:
qiime dada2 denoise-single
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left 15
--p-trunc-len 247
--p-n-threads 20
--o-table table-dada2-forward.qza
--o-representative-sequences rep-seqs-dada2-forward.qza
--o-denoising-stats stats-dada2-forward.qza

ChrisKeefe · September 9, 2020, 9:05pm

Welcome to the forum, @kkrigul!
I'm not a bioinformatician, but your numbers look reasonable and your approach seem well-thought-out. It looks like you're losing more sequences to quality filtering than to merging, which seems like a good thing here; you're allowing your reads enough room to merge.

q2-dada2 requires 12 nt of overlap between the forward and reverse reads - you could try making your trim/trunc parameters slightly more aggressive to squeeze a few more sequences through the quality filter. Be careful to leave a few nt of "padding" in addition to the 12 over lap needed to merge, though. Natural variation in region length happens, and you don't want to introduce bias by inadvertently dropping any sequence that's a couple nt longer than average.

Feel free to respond here if you have any specific questions or concerns!
Chris

kkrigul · September 11, 2020, 10:45am

Thank you for your kind response! I want to clear some details out just to make sure I understand everything correctly. Do I undestand correctly, that I can also just continue with my parameters chosen? What will I lose for example if I only use forward reads?

I'd like a little bit more information on how would I do that? In my parameters that I used previously, would this be a problem?

Thanks!
I am just learning myself and want to ask as much as possible from PROs

ChrisKeefe · September 11, 2020, 7:25pm

Probably, but I'd like to reiterate that I am not a bioinformatician. Some attrition is to be expected when denoising data. Every analysis presents decision points where there is no universally "right" answer, and you must make the best compromise you can for your individual study. It's worth taking the time to figure out exactly what you're doing when you choose these parameters, so you can make good choices for your study independently. You can always play around with the parameters you've chosen to try to optimize the number of retained sequences.

Consider spending some time with the literature, and maybe reading forum posts from other users, so you have the context to decide whether your results are reasonable. (This forum's search feature is awesome!) Are you losing more data during denoising than other researchers working with similar communities?

You'll lose half of the data you paid for. Shorter reads could mean less accurate feature classification, loss of variation, or even loss of entire regions of interest to your study if you drop too much data. I suspect you chose paired-end sequencing for a reason. This doesn't necessarily mean that it's a bad idea to use forward reads only - sometimes the data quality is so bad we have to make concessions. If you can preserve more data, though, you probably should.

Imagine you are targeting a region that is 100nt in length on average. Insertions and deletions happen, so some taxa will actually have 98nt sequences, while others might have 103. If you trim/truncate your sequences to exactly 100nt, you will have systematically removed all taxa with longer-than-average sequences, which could bias your results.

I don't know. Different taxa have different levels of sequence length variability. (IIRC, fungal sequences are highly variable in length, while most bacterial 16s regions are reasonably consistent.) It's your responsibility as a researcher to get to know the critters you're working with a bit, so that you don't systematically remove certain taxa during analysis.

system · October 13, 2020, 1:25am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.