Hi,
I need quick support if I am using the right truncate parameters here to apply to my paired end demultiplexed data using dada2, I am not sure if I have enough overlap as my sequence length statistics varies as follows
What region did you sequence, how long is that region, and how long do you expected you overlap to be when trimming at 293 and 227?
I ask because if you have enough overlap, you could trim off more of the R2 reverse read (maybe 220 or 210 instead of 227), which should allow more of your reads to join.
If I calculate based on Max length, then i will have only 12 bp overlap as my sequences have a range of 103, If this case applies then i actually have less overlap as required and this could be my problem , right??
Why i have 103 range in the legth of my sequences based on the sequence length statistics? Is this normal?
For my reverse reads, can i keep reads even the quality 25th percentile is lower than 30?
I tried to trunc less from the Forward reads as follows
Looks like using single-end is working great right now. However, paired ends will give us longer reads and more taxonomic resolution, so that's best, IF we can get them to merge.
Try trimming shorter, as this will lead to less mismatches in the area of overlap and should get more reads to join. Perhaps --p-trunc-len-f 260 --p-trunc-len-r 200
This is normal if the reads were pre processed or pre-trimmed before importing into Qiime2. It's not normal coming right off the Illumina machine. Do you know how your reads were reprocessed to get these variable lengths? We recommend doing all you processing within Qiime2 so there's a record of what was done to the data. This also lets us avoid issues with variable length reads. If you can get the raw data, that would work best!
Makes sense to me! My formula is a little different but they are mathematically equivalent.
That seems pretty short to me for the full V3-V4 region...
(See the regions listed in Johnson 2019, Figure 1 and Sup Table 1 (PDF))
Let's double check that number.
With such parameters, **--p-trunc-len-f 260 --p-trunc-len-r 200** i got even more worth results as following with less frequency of features too. why??
With too much truncation, the reads can't join because they don't overlap. With too little truncation, the reads can't join because there are too many mismatches. The goal is to find truncation settings that give us the maximum joining possible. If going shorter makes it worse, try going longer! (Or try something in the middle.)
Have you confirmed the length of region sequenced? Can you get the raw data from somewhere?
Length of region sequenced I calculated based on primers used
806-341 equals 465, Is this correct?
I have raw data demultiplexed, by genomic center, which makes me wonders Is in my raw seqs are still non biological seqs not removed? as i am trying with another part of the data that belong to different experiment and i got good merge there but much loss as chimeric!! If you could advice me how to check that using qiime would be great to move forward there too.
So if 465 plus minimum 12 bps overlap, then 485 at least, right? but I have reads shorter as my min reads length is as follows
Yeah, that sounds good! It's a lot closer than 300 bp!
With that new expected amplicon length, here's how I would do the math.
trunc-len-r + trunc-len-l - length-of-amplicon = overlap
260 + 200 - 465 = overlap
-5 = overlap
Negative overlap?? That's not overlap at all, that is a
5 bp gap!! :-(
This would explain why many reads failed to join!
The fact that your input reads have variable length tells me that they must have been processed after coming off the machine, as Illumina reads are always a consistent length. I think getting a copy of your raw data directly after demultiplexing and before any quality filtering would be helpful.