DADA2 Merge problems

Hi
I have a problem with DADA2 (version 2019.4.0) in QIIME2 (qiime2-v2019.04) in conda: the merge results are poor. The region is V3+V4, the primers are 341F-806R. The primers are attached to the sequences. I took 3 samples for a tests. The original demux.qzv file is attached.
I used the following script in QIIME2:

qiime dada2 denoise-paired
–i-demultiplexed-seqs demux-paired-end.qza
–p-trim-left-f 20
–p-trim-left-r 20
–p-trunc-len-f 286
–p-trunc-len-r 223
–o-table table.qza
–o-representative-sequences rep-seqs.qza
–o-denoising-stats denoising-stats.qza

The denoising-stats results:
F+R input filtered denoised merged non-chimeric
#q2:types numeric numeric numeric numeric numeric %
Shlomit001-Sh-Cn-0T-S1 40,340 33,272 26,315 14,905 13,547 34
Shlomit002-Sh-Cn-0T-S2 51,994 42,342 34,953 21,235 19,402 37
Shlomit003-Sh-Cn-0T-S3 45,586 37,210 31,013 18,938 17,263 38

Sequence length 400-425 bp.

Working with longer sequences --p-trunc-len-f 300 --p-trunc-len-r 289 , gave worse results.
I tried to use DADA2 in R for further possibilities, but it did not help.
I did DADA2 for forward files only, --p-trunc-len 286, and it doubled the recovery, but the sequences were shorter: 266 bp.

Are these results acceptable?
Is the sequencing is OK?
What can I do to improve the results?

demux.qzv (293.2 KB)

1 Like

Hi Shlomit,

Welcome to the QIIME2 community.

The denoising stats look fine except for the chimeras removing, which resulted in the lost of most reads. What’s the length of your forward and reverse primer? Are they all 20 bases? The length of commonly used v3-4 forward and reverse primer is 17 and 21.

Cheers, Yanxian

1 Like

Hi
In the stat table, the numbers of the last sample are: input 45,586, filtered 37,210, denoised 31,013, merged 18,938, non-chimeric 17,263. the recovery percentage as I calculated is 38%. So it looks like the problem is “merge” and not “chimeras”. The same pattern is in the other 2 samples.

Primers length are probably as you said, but I was told it is not very important. Is it?

Hi Shlomit,

Sorry about the confusion. I thought the number in bold was the sequence count after chimera removing. You’re right that you lost quite a bit of reads during the merging. As mismatches between the forward and reverse reads after denoising is not allowed, using longer forward/reverse truncate lengths may result in worse results. Your reads are of high quality. Based on the quality plot, maybe you can try –p-trunc-len-f 290, –p-trunc-len-r 261/287(median quality score 30/20)? You’ll still have a merged sequence length of ~510/~536 bps.

In my experience, most of samples retain more than 50% of reads after denoising by the DADA2 pipeline. But if samples have really bad read quality towards the 3’ end, you may see a huge loss of reads during merging because of mismatches or insufficient overlapping. Below are dada2 stats from one of our study for your reference, but they may differ greatly among different studies.

For the primer length, I assume it’s best to trim off the exact lengths as the presence of primer sequence may affect the chimera dectition. The new qiime2 taxonomic assignment method uses machine learning to train the model on the amplicon without primer sequences, so I guess it’s best to just trim off the exact primer sequence. But I may be wrong. QIIME2 developers may have better answers.

P.S.
I’m not sure if 3 samples are enough for the DADA2 to learn the error rate. Maybe also try a higher number of testing samples, say 6?

Cheers, Yanxian

2 Likes

Hi Shlomit, I was just troubleshooting samples for another group and these were roughly the recovery after DADA2. I think it’s acceptable, but here’s what I would do:

  1. Check the quality of forward and reverse
  2. Are these degenerate primers?

I found that the quality of overlap between the forward and reverse are likely the reason for many of these issues.

Please see this link where I was trouble shooting DADA2 V3V4 region:

I think after reviewing I had significant loss as well. Ben

1 Like