Different taxa result from DADA2 and Deblur

Setiawan · March 11, 2018, 3:27pm

Hello, I have conduct analysis using DADA2 from demultiplexed 2x 300 bp paired end sequence. My demultiplexed sequence are like this:

the command I use for DADA2 are same with this tutorial with parameters:
--p-trim-left-f 17 --p-trim-right-f 21 --p-trunc-len-f 250 --p-trunc-len-r 205 --p-n-threads 0 --verbose
the results that I get from that analysis are:

>                             input filtered denoised merged non-chimeric
> KT1_0_L001_R1_001.fastq.gz 195005   131324   131324    517          516
> KT2_2_L001_R1_001.fastq.gz 212028   142539   142539    371          371
> KT3_4_L001_R1_001.fastq.gz 232133   158605   158605    551          551

Then I followed up to taxonomic analysis using pre-trained classifier (greengenes 97%) and the result gives an archaea as dominant taxa:

Then I compare my analysis using Deblur following this tutorial with parameter attached
join-deblur.txt (900 Bytes)
and the results are (I think the number of sequences survived from both analysis are quite similar):

Then I followed up to taxonomic analysis and give very different result from DADA2:

The different results are confuse me up. What was happened to my results? Why it gave different major taxa between deblur and dada2?

Thank you

Nicholas_Bokulich · March 12, 2018, 2:18pm

Hi @Setiawan,
What primers/marker gene are you using? What is your expected amplicon length?

It looks like you are losing a massive amount of sequences, and the yields are (I would say) unacceptably low. From the dada2 log, it looks like the merging step is where this is failing. In deblur, it looks like maybe the trim length is the problem; you use 400, in which case any reads < 400 nt will be dropped. Is that intended, based on your amplicon length?

I would recommend adjusting your parameters to ensure that more reads are merging and passing the length filter. What happens if you process just the forward reads as single-end reads? (that could help triangulate this issue and tell us whether there is something wrong with these reads, e.g., lots of artifact/non-target DNA, or if this really is a merging/length issue)

I disagree, it looks like deblur recovers approximately 3X as many sequences as dada2, accounting for the Archaea that are observed with deblur but not dada2. With such low sequence counts, though, I am worried about the results from both.

I hope that helps!

Setiawan · March 13, 2018, 1:30am

Hi @Nicholas_Bokulich

I use V3-V4 primer, with expected amplicon length about 420 nt.

Yes, because the V3-V4 region is about 420 nt, so I trim my sequence at 400 nt. Following that, is it acceptable if I lower my trimming length in deblur for that region?

In deblur, I don't trim the primer from the sequence. How to solve this primer trim issue in deblur?

I will try this.

Archaea are observed in dada2 are very high (taxa-bar-plot) but not in deblur (taxa-bar-plot)

Nicholas_Bokulich · March 13, 2018, 1:47pm

That makes sense. Just making sure! Still, would be good to know how this performs when running the single-end data. For some reason merging reads appears to be failing in dada2. What primers are you using?

It looks like your parameter settings should leave enough overlap to permit merging. The sequences are being dropped at the merge step, not denoising, so your truncation parameters do not appear to be the issue. @benjjneb do you have any thoughts on what's happening here?

Yes. As a matter of fact, this may be beneficial because the quality-filter step before deblur is going to trim sequences at different sites, wherever the quality drops below a certain threshold. So deblur's trimming step is going to do two things: drop any sequences shorter than that length (unless if you turn off trimming) and trim sequences longer than that length. Hence, depending on how much your sequences are being pre-trimmed in quality-filter you may want to set a lower deblur trimming length than what you are using in dada2.

You can use q2-cutadapt to trim out primers. See a tutorial here.

Oops, looks like I misread what you wrote above. Thanks for clarifying!

Still, the different taxonomic results here do not concern me — because it seems that these data are being over-filtered already, so the results from either analysis is not representative of the whole. If you get wildly different results after improving sequence yields, then I think we have something to worry about!

Let me know how it goes!

benjjneb · March 13, 2018, 2:05pm

The protocol includes the primer pair sequences for the V3 and V4 region that create a single amplicon of approximately ~460 bp.

From the Illumina 16S Sample Prep guide, assuming that is the protocol being used.

It's 460 rather than 420 because this library prep includes the primers on the reads. So, there is 420nts of biological sequence, but 460nts of sequenced amplicon. Right now, everything except a crazy short archaea 16S is being lost in merging due to lack of overlap.

You need to increase your trunc-len to keep at least 460+20+BIOLOGICAL_LENGTH_VARIATION total bases. 485 or up would be a good target.

Setiawan · March 18, 2018, 10:12am

Hi @Nicholas_Bokulich and @benjjneb,

I already re-analysis my data using your suggestion.

I followed your suggestion by increasing the trunc-len to 250:

qiime dada2 denoise-paired --i-demultiplexed-seqs paired-end-demux.qza --output-dir dada2-250 --p-trim-left-f 17 --p-trim-left-r 21 --p-trunc-len-f 250 --p-trunc-len-r 250 --p-n-threads 0 --verbose

and got better result than previous analysis using DADA2:

                        input filtered denoised merged non-chimeric
KT1_0_L001_R1_001.fastq.gz 195005 80638 80638 30795 21510
KT2_2_L001_R1_001.fastq.gz 212028 82058 82058 31708 22226
KT3_4_L001_R1_001.fastq.gz 232133 94157 94157 41039 29853

and the taxa bar plot also improving (dada2-taxa-bar-plot)

I have another question, why the sequence improving very much? Previously, I set my trunc-len based on quality of forward and reverse reads, so I set my --p-trunc-len-r 205 (demux.qzv) and got very low sequences.

I already re-analyze my sequence with deblur using --p-trim-length 380, but there is no improving in sequences recovered. I also used default parameter for quality-filter q-score-joined and the result is not improving too (deblur stats). I think my parameter with vsearch join-pairs --p-maxdiffs 2 --p-minovlen 15 --p-minmergelen 300 are not too strict because I still got >18000 raw reads before deblur analysis. Do you have any idea? or my trim-length still too high?

Thank you

benjjneb · March 19, 2018, 1:35pm

Because when you set --p-trunc-len-f 250 --p-trunc-len-r 205 your reads didn't overlap, hence essentially all wer were lost in merging, but but now they do. Schematic...

FFFFFFF(250)FFFFFFFF
                      RRRRR(205)RRRRR

FFFFFFF(250)FFFFFFFF
                 RRRRRR(250)RRRRRRRRR

Setiawan · March 19, 2018, 1:49pm

Thank you for the answer
So when using DADA2 we don't need to set the trunc-len based on demux.qzv but only by the target length.

benjjneb · March 19, 2018, 2:51pm

Not quite. You should trim off post-quality-crash read tails as indicated by demux.qzv, while respecting the constraint that the truncated reads must still sufficiently overlap to be merged.

system · April 19, 2018, 8:51pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.