I am trying to use dada2 on my 16S paired-end sequences, from a 2 x 250 bp illumina run. But I am losing ~50% of my reads at the merging step. This is despite the fact that there should be enough overlap.
My reads contain primers. I have tried both: removing the primers with cutadapt prior to running dada2; and I have tried trimming primers with the p-trim-left parameter in dada2. But still see the same problem of loss of reads during merging.
I have also tried to use the deblur method, where I first remove primers with cutadapt then join the reads using vsearch. During this step there are no problems merging my reads. However, I then lose a lot of reads during deblur denoise. Is it normal to lose the majority of your reads in deblur?
Given that they merge with vsearch, I am not sure why they don't with dada2.
My primers are 515F - 926R. When I merge with vsearch, most of the merged reads have between 370-375 bp.
Below is my original demux file. Demux after joining. Dada2 stats when primers were removed with trimming. Dada2 stats after primer removal with cutadapt. Deblur stats.
I am a bit confused what you mean here. Dada2 trims/truncates reads with the parameters that I specified - and there should still be plenty of overlap so this shouldn’t cause problems merging. It also filters and denoises, but the loss of reads during this steps is recorded in the dns, and while there is some loss of reads (as expected), there is not a huge loss of reads. The main loss comes at the merging step. Are you saying that it is possible that the denoising is then leading to issues merging?? I am very new to this so still learning.
Yes, that’s what I’m saying. (It may be the quality filtering step, rather than the denoising). But, essentially in one of these steps, the algorithm may sometimes trim lower quality sequences before your cut off.
I’m not sure about this. I know that the --p-trunc-q parameter (with a default of 2) truncates reads with low quality sequences. But it says in the documentation that “If the resulting read is then shorter than ‘trunc-len-f’ or ‘trunc-len-r’ (depending on the direction of the read) it is discarded.” So the read would not end up too short, it would just be discarded, in which case I would see the loss of reads in the filtered column of the dns.
With your primer set we would expect an amplicon size of 926-515= ~414 bps. Given your 2x250 run we would expect about 500-414= ~ 96 bp overlap region. Given your truncating parameters of 225 and 200, you are cutting ~75 bps from the overlap region. This leaves us with 96-75=21 bp overlap region left. q2-dada2 also requires a minimum of 12 nt overlap to merge reads, otherwise it will discard them. So that leaves us with 9 bp overlap to spare.
Now consider that your region is naturally (slightly) variable and so any reads that are 10 bp longer than our crude average calculation will not merge properly and get discarded. I suspect this is why you are losing so many reads at this step. This is my guess anyways But I think its important enough to redo this step. Mainly because if my hunch is right, you may have already introduced bias by removing specific taxa that are naturally a bit longer from your dataset. This may or may not be significant.
a) Update to qiime 2019.7, then
b) Relax your truncating parameters a bit. You have great quality reads on your forwards, so maybe truncate less, say set it to 243, and your reverse say at 210.
In theory this should improve your merging step and include those other taxa. (look for higher number of unique features in your summary).
I appreciate your response but I am unsure if it is correct. You said:
With your primer set we would expect an amplicon size of 926-515= ~414 bps.
I thought that this was the amplicon size before removing primers. However, in my case I already removed the primers with cutadapt. Moreover, when I remove primers with cutadapt and then merge reads with vsearch, most of the merged reads have between 370-375 bp. This is consistent with my understanding that the amplicon length is without primers, because ~414bps - 39bps=~375 (my primers are 19 and 20 bps).
You are right, I didn’t realize the run being referred to was the one you had removed the primers using cutadapt first. I just looked at the quality plots which had shown 2x250 run (usually means primers haven’t been removed). So you are correct that your merging should have been suffice with 41 and 50 overlaps between your 2 DADA2 runs. The difference then probably coming between Vsearch and DADA2 merging, and it would be somewhat hard directly comparing these two. For example DADA2 doesn’t allow any N nucleotides which removes all reads with an N, whereas Vsearch by default doesn’t have this. Vsearch also does have some odd default parameters which I haven’t quite figured out and it does raise some concerns for me, for example: its minoverlap is 10 but so is its allowable mismatches (default=10) in that region. The maxEE is 1 in vsearch and 2 with DADA2. Perhaps @colinbrislawn can throw in his two cents in there regarding these default parameters.
Anyways, I think your best run is probably the dns-trim2 and you would only really have about 3 samples with low total counts.
One side suggestion also, from your vsearch joined quality plots it looks like your reverse reads contained a bunch of reads without the primers <- this explains the sudden dip in quality at the 374 pt. When I need to use vsearch for this purpose I typically turn on the discard-untrimmed parameter to get rid of those reads, in my mind if they don’t have the primers, they are most likely noise and artifacts which I don’t want to keep anyways.
ps I Just noticed that I also did simple subtractions wrong in my original post 500-414 = 86…not 96
Hmm, those Vsearch defaults do seem a little odd. And yes, the best run is the dns-trim2 (with only 3 samples with low counts); however, I am still a little concerned given that I don’t know why I am losing so many reads in the merging step - is this an issue? is it going to lead to bias?
I personally don’t think you should be worried about any bias here since you have a decent overlap region and 16S amplicons tend not to have large length ranges. If you had a shorter overlap region say 10-20 bp then I might get worried about bias because it is possible that you would exclude taxa that were naturally a bit larger than others, but in your case I think ~50bp is on the safe side of things.
Why the merging is failing is not clear to me either, it could be because of mismatches? The default allowable mismatches in overlap region for DADA2 is 0 which is much more strict than whatever Vsearch has. What I am not sure about however is whether that is treated the same if the overlap region is 20 vs 50, you could imagine that as the overlap region increases the likelihood of a mistmach also increases. As to why those mismatches exist in the first place, that may be dependent on how accurate the error model was that inferred the sequences in the first place. As you can see there are alot of moving parts, its hard to isolate out the exact cause of this. You may have better luck doing this in the native DADA2 in R as it allows you to retain those discarded reads and so you could do some exploration there.
One thing that MAY help, if the problem is due to subpar error model, is increase the training reads # by increasing --p-n-reads-learn. This is just a guess though, not sure how much of a difference it would actually make. Sorry couldn’t be more helpful!
Hi @Mehrbod_Estaki, you have given me a lot to think about! I will try and truncate further, thus reducing overlap region and potential number of mismatches. If that doesn’t help, I will play around with the training reads #. I will let you know how it goes.
Looking forward to hearing of your results. I do think I’m missing something important here with regards to the “increased mismatch with increased overlap” theory, I doubt that would be the case, seems a bit careless? but who knows…
You can always try DADA2 in R and setting mismatch to 1 or 2 to see if that improves things.
I tried truncating further as well as increasing training reads #. Neither made much difference to number of reads merging.
I don’t yet know how to use DADA2 in R, but this is something I might look into in the future.
Thanks for the help.
I suggest one method. Do not truncate your reads as an examination. Then compare the two results (old and new ones). You will catch an idea!
If you did not find out something then share your results alongside more info, such as primer sets, Qiime2 version, commands, and table screenshot.