Merging outside QIIME2: results for DADA2 and deblur

llenzi · March 16, 2018, 4:19pm

Hi all,
first of all, thanks for QIIME2 is a great tool!

I would like to ask your opinion if it is possible, on my experience I'm going to describe on the light of the following topics:

I'm in the process to test different protocols to process my sequences, in particular I'm using 2x250bp V4 region, on MiSeq (Zymo moch community) and HiSeq 2x300bp V3-V4 region.

My experience with the 2x300bp merged with DADA2, is pretty similar at the one described in one of the post I linked, with a final set of about 500 sequences per sample (merging percentage close to 0!). However if I merge the same set with PEAR I got a merge percentage around 97% (rising to almost 99% if not excluding sequences containing Ns).

My question to you is, what is your experience on merging the 2x300bp within DADA2? I'm trying to understand if is me, still not quite right on how to set the trimming lengths on DADA2.

I tested these two pipelines with the 2x250bp mock community, and seems pairing outside DADA2 gives in my hand really good results (either using DADA2 or deblur to denoise). All the expected species are there (including listeria detected as species level if I use rdp instead sk-learn, usually detected only at order level).
Pairing within DADA2, gives quite skewed results compared with the expected.
To show the difference, is there a way to pass you the picture in private, so you can see what I mean, please?

In your latest tutorial on how import pre merged reads, you suggest NOT using DADA2. Still there is a single sequence mode. Can you explain further why DADA2 is not suited for that, please?

Hope it makes sense, let me know if you need more details.

Thanks for your help

Nicholas_Bokulich · March 16, 2018, 4:34pm

This is indeed an issue with how you are performing your trimming. The low merge yield is a strong indicator.

PEAR is performing the merge prior to quality trimming, and then may be assigning an arbitrary Q value to the overlapping bases (I am not familiar with PEAR but other read joiners do this).

You have two routes:

if you want to use dada2, you will need to adjust your trimming parameters. See this post, which is reporting a very similar problem. If that does not help you solve the problem on your own, please report back with the same data (quality score plots, exact commands used) and answers to questions in that thread, and we can help solve this.
If you want to use deblur, you can join reads prior to denoising (either using PEAR and then importing, or using q2-vsearch to perform similar read joining from within QIIME2). This may be the easier approach, since you already have joined reads and would not need to fuss over trimming parameters.

This is probably an issue with your trimming parameters, as V3-V4 should be joinable with dada2, provided read quality is good enough to support overlap. We get lots of questions about trimming lengths on this forum, and it all depends on the primers, expected amplicon length, and read quality. See above for how to diagnose this or get support. You certainly should not be losing close to 100% of your reads.

That is expected, given that you are losing close to 100% of reads and so this is skewing the read profile.

Yes, you can send a direct message to forum moderators (e.g., me). But I understand what you mean — and linked to a very similar post above — so you do not need to send. I am in full agreement: the dada2 results need to be re-done because your first analysis had inadequate trimming parameters.

Joining reads disrupts the Q score profile, which dada2 uses to predict read errors. Many read joiners assign arbitrary Q scores to overlapping bases, which confuses dada2. So you need to use dada2 on the unjoined paired reads and dada2 will join these reads after denoising forward/reverse reads separately. You must adjust trimming parameters to make sure there is at least 20 bp of overlap between forward and reverse reads.

I hope that helps!

llenzi · March 16, 2018, 5:23pm

Hi Nicholas,
many thanks for your help.
I'm now focusing only on the 2x250bp reads for the mock community, I have the Zymo sequenced in different PCR conditions. So I expect some differences.
The quality of the sequences is the following:

I'm using the following settings for DADA2:
qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --p-trunc-len-f 220 --p-trunc-len-r 220 --p-no-hashed-feature-ids --p-trunc-q 25 --p-n-threads 0 --o-representative-sequences rep-seqs-dada2.qza --o-table table-dada2.qza

Should I go more strict and trimming at 160bp for R1 and 200bp for R2? What you think?

(was actually QIIME2 released on November last year to be complete)
The taxonomy obtained with gg and sk-learn:

If I merge with PEAR and denoise with deblur (as would be in your tutorial), I obtain the following taxonomy:

If I use DADA2 instead of deblur I got (although I did the assignment with rdp):

So, the point is clear: the merging step affect a lot the result.

My final goal is to script all the procedure, so merging outside DADA2 is quite handy. But first of all Id like to understand the differences and the correct way to do the analysis. That because the merging with PEAR seems giving really good results (at least in the context of the Zymo community but this is another story ... )

Going further to your clarification on dada2 (much clearer now thanks!), why deblur should not be affected by this altered Q profile?

Many thanks
Luca

Nicholas_Bokulich · March 16, 2018, 5:40pm

Thanks for providing additional data and plots, @llenzi,

I would remove the --p-trunc-q parameter and see if that impacts results.

Also run dada2 with the --verbose flag, so that the denoising summary is printed out (maybe you are already doing this but it's not in that command — this would be useful information for us to diagnose). Could you share those logs?

Your read qualities look pretty good, and those trim lengths seem reasonable, so it is a bit mysterious that dada2 is not working. For V4 (515f/806r primers, at least), 220 + 220 nt should be adequate.

V4 is short enough that just using the 250 bp forward reads should nearly cover it. That is another option to compare — just use the forward reads and discard the reverse.

It sounds like your reads are being dropped at the merge step (we would know for sure if you show us the log printed by dada2 with the --verbose flag on), not at the denoising step. If reads were being dropped at denoising, it would help to trim further in case there are low-quality segments near the 3' end. If these are being dropped at the merge stage (the log would tell us for sure), then that is suggesting that the reads are not long enough! But that does not make sense for V4 with these trimming parameters.

One more idea: are primer sequences trimmed from these reads? This should always be done for dada2/deblur and in dada2 can be done with the --p-trim-left parameters. For deblur, you can use q2-cutadapt to trim primers.

I agree, it sounds like joining reads and then using deblur is better for you. Note that you can join reads with qiime vsearch join-reads, if you wanted to keep it all in QIIME2 (thus preserving that information in provenance). I do not know the technical differences between PEAR and vsearch for read joining, though.

deblur does not use Q scores. It uses a static error model to predict read errors. You will need to see the original publication for more details.

So it looks like RDP and QIIME2's sklearn classifier give similar results for all taxa except for Listeria. It is not entirely clear, because you show genus-level results for the first two plots, then species-level with RDP. But Listeria definitely gets better classification results with RDP, which is interesting because these are two very similar methods (both naive Bayes classifiers). You may want to adjust the confidence parameter to see if this improves your results, since you have mock communities to validate your results. The default is a good average across many different sequencing lengths/conditions but since you have a mock community you can re-tune this for your sequencing runs. Notably, QIIME2's default confidence == 0.7, whereas RDP's recommendation for short reads is 0.5, 0.8 for longer reads.

See this thread for a little more discussion — I think they are using the same mock community that you are.

Good luck!

system · April 16, 2018, 11:40pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.