Please help dada2 trun parameters PE loss of up to 70% seq. after merging??

I need quick support if I am using the right truncate parameters here to apply to my paired end demultiplexed data using dada2, I am not sure if I have enough overlap as my sequence length statistics varies as follows

Here is my demux.qzv and command, also screenshot

crassa-demux.qzv (308.3 KB)

qiime dada2 denoise-paired
--i-demultiplexed-seqs /mnt/d/16S_WinterData_Files/A_crassa/A_crassa-demux.qza
--o-table /mnt/d/16S_WinterData_Files/A_crassa/dada2/crassa-table.qza
--o-representative-sequences /mnt/d/16S_WinterData_Files/A_crassa/dada2/crassa-rep-seqs.qza
--p-trim-left-f 0
--p-trim-left-r 0 
--p-trunc-len-f 293
--p-trunc-len-r 227
--o-denoising-stats /mnt/d/16S_WinterData_Files/A_crassa/dada2/crassa-stats-dada2.qza

Here is my stats after denoising
crassa-stats-dada2.qzv (1.2 MB)

Looking forward to get any help to move forward!!

Hello @Sabrin,

What region did you sequence, how long is that region, and how long do you expected you overlap to be when trimming at 293 and 227?

I ask because if you have enough overlap, you could trim off more of the R2 reverse read (maybe 220 or 210 instead of 227), which should allow more of your reads to join.

Thanks for your support. V3/V4 region length 300 bps.

I calculated ovelap like that,
300+300- mean sequence length- trunc bp F- trunc bp R
300+300- 458 - 7 - 73 equals 62

Should I calculate based on the mean sequence length or the maximum sequence length as my sequences legth varies as follows

If I calculate based on Max length, then i will have only 12 bp overlap as my sequences have a range of 103, If this case applies then i actually have less overlap as required and this could be my problem , right??

Why i have 103 range in the legth of my sequences based on the sequence length statistics? Is this normal?

For my reverse reads, can i keep reads even the quality 25th percentile is lower than 30?

I tried to trunc less from the Forward reads as follows

--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-f 300
--p-trunc-len-r 228

and I got almost identical merging results, it doesnot improve anything..

Finally, when i use forward read only i retain many of my sequences as non-chimeric

which makes me as, Is in this case using single end is better? Or using paired end has advantages in the downstream analysis?

Good morning!

Let's work backwards:

Looks like using single-end is working great right now. However, paired ends will give us longer reads and more taxonomic resolution, so that's best, IF we can get them to merge.

Try trimming shorter, as this will lead to less mismatches in the area of overlap and should get more reads to join. Perhaps --p-trunc-len-f 260 --p-trunc-len-r 200

This is normal if the reads were pre processed or pre-trimmed before importing into Qiime2. It's not normal coming right off the Illumina machine. Do you know how your reads were reprocessed to get these variable lengths? We recommend doing all you processing within Qiime2 so there's a record of what was done to the data. This also lets us avoid issues with variable length reads. If you can get the raw data, that would work best!

Makes sense to me! My formula is a little different but they are mathematically equivalent.

trunc-len-r + trunc-len-l - length-of-amplicon = overlap
293 + 227 - 300 = overlap
220 = overlap

And finally,

That seems pretty short to me for the full V3-V4 region... :thinking:
(See the regions listed in Johnson 2019, Figure 1 and Sup Table 1 (PDF))
Let's double check that number.

With such parameters, **--p-trunc-len-f 260 --p-trunc-len-r 200** i got even more worth results as following with less frequency of features too. why??

crassa-rep-seqs.qzv (611.6 KB)
crassa-stats-dada2.qzv (1.2 MB)
crassa-table.qzv (502.6 KB)

Good morning Abdelghany,

With too much truncation, the reads can't join because they don't overlap. With too little truncation, the reads can't join because there are too many mismatches. The goal is to find truncation settings that give us the maximum joining possible. If going shorter makes it worse, try going longer! (Or try something in the middle.)

Have you confirmed the length of region sequenced? Can you get the raw data from somewhere?

Thank a lot for the quick response.

Length of region sequenced I calculated based on primers used
806-341 equals 465, Is this correct?

I have raw data demultiplexed, by genomic center, which makes me wonders Is in my raw seqs are still non biological seqs not removed? as i am trying with another part of the data that belong to different experiment and i got good merge there but much loss as chimeric!! If you could advice me how to check that using qiime would be great to move forward there too.

So if 465 plus minimum 12 bps overlap, then 485 at least, right? but I have reads shorter as my min reads length is as follows

so it should not be less that 508 bps or 508+12 overlap?

Yeah, that sounds good! It's a lot closer than 300 bp!

With that new expected amplicon length, here's how I would do the math.

trunc-len-r + trunc-len-l - length-of-amplicon = overlap
260 + 200 - 465 = overlap
-5 = overlap
5 bp gap!! :-(

This would explain why many reads failed to join!

The fact that your input reads have variable length tells me that they must have been processed after coming off the machine, as Illumina reads are always a consistent length. I think getting a copy of your raw data directly after demultiplexing and before any quality filtering would be helpful.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.