dada2 can not merge a large number of reads

Hi, Developer,
Thanks so much for such great software.
I met some problems in the analysis during dada2 merge procedure. My primer is 515F-806R. Raw reads are 151bp.
I do not know why my reads are all lost during merge procedure, which is not normal. How can I change the parameters to change this situation? I looked the blog in QIIME2 forum, still have not found the answer.

Appreciate it!demux.qzv (295.4 KB)

Hi @Brandon, thanks for the table, as well as the demux summarize viz, those are both good things for us to see. Unfortunately though, we appear to be missing the command you ran, which has important information in it, like your trim/trunc parameters. Can you please provide that here? Thanks!

Hi, @thermokarst,
Thank you for the quick reply. :smiley:
I use the code:

qiime dada2 denoise-paired --i-demultiplexed-seqs demux-paired-end.qza --p-trunc-len-f 151 --p-trunc-len-r 151 --p-trim-left-f 0 --p-trim-left-r 0 --o-representative-sequences rep-seqs-dada2.qza --o-table table-dada2.qza --o-denoising-stats stats-dada2.qza --p-n-threads 0

Hi, @thermokarst,

I think I have found the reason after I go back to DADA2 source page.
I have tested R1 reads, and results show that 100% R1 reads can be classified taxonomy. So I think the problem for my data could be the pair-ended reads do not have enough overlap.
May I know how to merge the reads without overlap? Eg, with an insert?
Tons of thanks!

That is correct. You need 20 nt of overlap for dada2 to successfully join your reads. The V4 region is around 290 nt long, so 150 + 150 is not quite long enough for many to join.

One alternative: follow this tutorial to join with vsearch, then denoise with deblur.

The other alternative: use only forward reads.

Not possible in QIIME 2. But if you figure out a way to do this on the raw fastq reads, you can then import those into QIIME 2 and use deblur or OTU clustering to process those data.

Hi, @Nicholas_Bokulich,
Thanks so much for the response.
I have tried deblur with that method. But I think this method trimmed >50% of my reads, and treat them as fraction-artifact-with-minsize. I do not quite confident. Why all these reads are treated as artifacts. Tho total reads number shrunked a lot. Even I understand that 16S analysis is based on composition analysis, not on reads number.
Again, really appreciate for your help. Happy QIIMING:rofl:

I’m not an expert on deblur, but my understanding is what you are seeing is normal.
It subtracts the expected error* from the read counts which has the side effect of removing spurious OTUs.

*I have no idea how it determines this however.


Hi, @ebolyen,

Appreciate your help. :slight_smile:

But I compare my resultS with the OTU clustering(VSEARCH with dereplicate-sequences) results, the two results differed a lot. I am not so confusing which is a better one. Since DEBLUR trunked so many reads. Even though two methods do not have a benchmark, currently. But I think the results should not differ too much. :thinking:
OR DADA2 raw code(in R) can change to make an insert which makes sure short reads can be merged?
I am so confused now. I love QIIME2, I have been using it for more than one year. I do not want to change the software. :mask:
Any expert can help me? Tons of thanks.

Hi @Brandon,

which threshold did you apply in the OTU clustering step?
The denoising step (either dada2 or deblur) are (at least the way I see them) kind of clustering at 100% similarity, after error correcting the reads (but they apply two different methods to identify and correct the sequence errors).
So it is kind of hard to compare these methods in your dataset, unless you have a positive control in the dataset to us as reference.
I’m not very used to this read length, but using VSEARCH to cluster R1 into OTUs, I would be worried that it would be a bit over clustering given the relative short length of your reads (but as said I think it is difficult to test without a known community in the pool). I don’t really see the point in using a very strict clustering threshold without any error correction step on the sequence (but other may disagree on that …)
In your place, I would process R1 with either dada2 single or deblur.
Best wishes,

deblur and OTU clustering have many large differences; the most important difference here is that deblur uses a set of reference sequences to perform a rough “positive filter”, i.e., discard anything that does not at least remotely resemble those reference sequences. In your case, deblur is removing around half of your reads because they appear to be “artifact” in the biological sense, i.e., they do not resemble the reference sequences. So the fact that sequences are being removed is a good thing; you are removing garbage!

Both have been benchmarked quite a bit; see the literature, especially the original deblur publication and a head-to-head comparison of these methods.

Hi, @llenzi,

I use 0.99 in the clustering. Originally I want to use DADA2, but DADA2 can not merge most sequence because of the short overlaps(<20bp).
I am not using R1 to clustering OTU, I use the R1-R2 merged and cleaned reads to clustering OTU.
Exactly, I am also concerned about the non-biological sequences. But I have checked the R1 in DADA2, DADA2 result tells me that, most of the reads can be used, are the biological sequences. DADA2 will only delete a few sequences.
So I am so confusing. :pensive:



Hi, @Nicholas_Bokulich,
Thanks for the suggestions and paper guidance. Now I understand DEBLUR much better. For denoising methods, I believe they give me much more biological sequences. Since I have been used DADA2 for a while. It is really a great method.
Just now I picked some of my samples to do analysis on the three methods. DADA2 with R1, VSEARCH-99% OTU with R1+R2, DEBLUR with R1+R2.
From the microbial composition, the results is so confusing, while DEBLUR is so different.

Hi @Brandon,
So you are merging R1 and R2 first, then applying VSEARCH-99% OTU or DEBLUR?
In this case I think the merging step is the limiting factor, selecting for only the amplicons derived from short 16S, therefore limiting the number of final sequences actually clustered or corrected. It seems from your point that the poor merging performance is affecting deblur more than the classical clustering.
Have you tried DEBLUR with R1 only?
As last option you may try to perform reference based clustering, using R1 only, with VSEARCH and your reference.
In any case I would avoid any attempt of merging R1 and R2. Also, DADA2 + R1 looks quite clean to me, why you would not use it?

Best wishes,

1 Like

Hi, @llenzi,

Thanks to the patience to help me.
(1) Honest to say I do not quite understand why only use R1 to do analysis since my Illumina sequencing library was paired-ended mode set up. If I only use R1 in publication, which will seem strange.
(2) Why merge make different? I have tried the merge in FLASH, q2-vsearch-join pairs, MOTHUR. They all give me the distribution are most at 253bp, then 252 and 254bp.

Enclosed I add R1-deblur result. It is more confusing. 100% reads are 151 bp, so the setting length is 151

I am looking forward to hearing more suggestions.

Hi @Brandon,
I’ll give my point of view on but please remember that other may disagree.

I’ll start on your point (2). The 16S primers you using produce an amplicon roughly of 300bp. Of course there is a variability for this length depending on the species in the samples, may be short but may be a bit longer as well (where the limiting factor is the performance for the polymerase). Considering at the moment the length of 300bp, as you rightly saying, using the 2x150bp sequencing none of the tool you mention would be able to join R1 and R2 because there is no overlap between the two. I am not aware of any sensible methods to joint R1 and R2 in this case, but I would be suspicious if I found one: it probably would imply to predict a joining sequences, not sure from where and of which length. Also I would not how to treat possible chimeric sequences, in case you have, within a pair, R1 from one 16S and R2 from a different one (not an impossible case from the PCR step). Not considering the fact that I try to remove from the begin of the analysis any artificial sequence and any of this joining method would go in the opposite direction for me.

Considering shorter length amplicons, you may get joined sequences (with the length distribution you are seeing), but where they are coming from? It may derive from real short 16S but also from PCR artefacts. You will end up with a pool of these, and if we trust deblur results with a ratio about half and half.

Concluding, my point is that joining the reads you taking the risk of getting a resulting dataset enriched for the short sequence 16S and for PCR artefact, losing longer 16S. Now, if you have any source of information that in your samples you expect species with short 16S, you may not find this as problem.

On your point (1), I am not used to the 2x150 bp sequences, however I know there are other users that fairly commonly work with this read length (at least some using R1 only for the analysis). I understand it may be difficult to discuss this in a possible paper but you have equally to discuss the possible bias due to the non-overlapping reads. I suppose controls would be very useful in both cases.

On the difference between dada2 and deblur, given these are different methods I have no problem on the fact that they give slightly different results. For me just pick one, you have plenty of references for both. Again, only using a known community would help in the choice.

I hope I did not create more confusion.

Best wishes,


Hi, @llenzi
Thanks so much for your patient explanation along the way. You did not create any confusion to me. It helps me understand more.
Appreciate it!
Hope you a great week!



This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.