Problem with Deblur and DADA2 for paired end reads

doudou2047 · November 28, 2018, 2:31pm

Hi, Everyone,
Good day.
I have two questions related to "Deblur and DADA2":
My Questions are:

For DADA2 I did not run successfully for almost 7 days. Could I ask what is wrong for my DADA2 script? And DADA2 includes default pair-end join function,right?
For Deblur, I did not merge or join the pair-end sequences, it has taken 30 hrs and the results, table.qzv list below. Could you help me check it is the wrong way to denoise paired-end reads? Thank you.

#1. Import data
qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path pe-33-manifest
--output-path paired-end-demux.qza
--input-format PairedEndFastqManifestPhred33

The paired-end-demux.qzv file list as follows:

#2. Deblur & DADA2
########### denoise with Deblur#################
qiime quality-filter q-score
--i-demux paired-end-demux.qza
--o-filtered-sequences demux-filtered.qza
--o-filter-stats demux-filter-stats.qza
qiime deblur denoise-16S
--i-demultiplexed-seqs demux-filtered.qza
--p-trim-length 270
--o-representative-sequences rep-seqs-deblur.qza
--o-table table-deblur.qza
--p-sample-stats
--o-stats deblur-stats.qza
qiime metadata tabulate
--m-input-file demux-filter-stats.qza
--o-visualization demux-filter-stats.qzv
qiime deblur visualize-stats
--i-deblur-stats deblur-stats.qza
--o-visualization deblur-stats.qzv
mv rep-seqs-deblur.qza rep-seqs.qza
mv table-deblur.qza table.qza
########### denoise with DADA2#################
qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 150
--p-trim-left-r 55
--p-trunc-len-r 200
--p-trunc-len-f 270
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

tabulate

qiime metadata tabulate
--m-input-file stats-dada2.qza
--o-visualization stats-dada2.qzv

mv 'file name'

mv rep-seqs-dada2.qza rep-seqs.qza
mv table-dada2.qza table.qza

############ Table.qzv from Deblur method###########

jwdebelius · November 28, 2018, 2:39pm

Hi @doudou2047,

The fun thing about the microbiome denoising space is that there are few truly "wrong" answers*.

Dada2 joins the reads for you. Since it's doing read joining, quality filtering, and chimera removal all in one step, it's not suprising to me that it takes a long time to run, especially when dealing with samples that have between 1M and 8M reads.

Deblur does not merge your reads, but there's a nice tutorial about using paired end data with Deblur that I've used as a model for some of my recent projects.

I'm not sure how to interpret the comparisons between your sequences and denoised data. Is the concern sample loss, sequence loss, something else?

Best,
Justine

*With the perpetual caveat that you don't try to merge two tables denoised of different length sequences or different regions

doudou2047 · November 28, 2018, 3:52pm

Hi, Justine,
Thank you so much for your information.
Since I did not join the paired ends reads with Deblur, so it is hard to say the results I got from "Table.qzv" is correct to reflect the features.
Seems I need repeat Deblur or keep the DADA2 running.

I would like to make sure that when you mentioned "wrong"-*With the perpetual caveat that you don’t try to merge two tables denoised of different length sequences or different regions. Could I double check that with you? It means when we already denoised the sequences do not merge them, or we join the paired ends first (it means merge?) and get denoised by Deblur or DADA2?

For now, if I choose Deblur, I need to follow the steps of Alternative methods of read-joining in QIIME 2, my question is the trim length here is only for already joined reads, right? the --p-trim-length 250 means?

qiime deblur denoise-16S
--i-demultiplexed-seqs demux-joined-filtered.qza
--p-trim-length 250 \
--p-sample-stats
--o-representative-sequences rep-seqs.qza
--o-table table.qza
--o-stats deblur-stats.qza

Thank you so much for your help and your patience.
Qiong

jwdebelius · November 28, 2018, 4:02pm

Hi Qiong,

First, you can 100% use just the forward reads. It's a less common approach, but there are some large groups who do this. So, there's nothing wrong, per say, with the unpaired table. (Again, within a single study, very few things are objectively wrong.)

If you're denoising by DADA2, you do not join your ends.

If you are denoising with Deblur, you need to join them and quality filter before you run Deblur.

doudou2047:

For now, if I choose Deblur, I need to follow the steps of Alternative methods of read-joining in QIIME 2, my question is the trim length here is only for already joined reads, right? the --p-trim-length 250 means?
qiime deblur denoise-16S
–i-demultiplexed-seqs demux-joined-filtered.qza
–p-trim-length 250 \
–p-sample-stats
–o-representative-sequences rep-seqs.qza
–o-table table.qza
–o-stats deblur-stats.qza

Yes, in this example, the merged sequences are being trimmed to 250 bp.

Best,
Justine

Mehrbod_Estaki · November 28, 2018, 8:01pm

Hi @doudou2047,
Just to add on to @jwdebelius excellent advice, I noticed that your DADA2 parameters are a bit extreme.

In this scenario you are trimming 150 bp from the 5' of your forward reads which actually look great to me. I would just not trim anything from here at all. Unless you are intentionally focusing on a very specific region of these reads? Seems odd and wasteful though. Your truncating parameters look ok, as long as you make sure you retain 20+bp overalp for merging in DADA2. See other posts on the forum on how this is calculated.
In both deblur and dada2 these are taking a very long time because you are not utilizing their multi-thread option and are dealing with over 80 mil reads.
You may want to look at the --p-n-threads INTEGER parameter in DADA2 and --p-jobs-to-start INTEGER in deblur to run these in parallel. It should significantly decrease run time so they don't take weeks!
And lastly, as @jwdebelius mentioned, for DADA2 you want to make sure this is not a combination of multiple runs, because you should run each run separately. Deblur doesn't have this requirement.

doudou2047 · November 29, 2018, 3:54pm

Exactly, Thank you so much for your suggestions. @jwdebelius, @Mehrbod_Estaki
And Thank you so much for your patience.
I understand that problem now and change the parameters as follows, at first I had a misunderstanding about trim and trunc, so the trim size I think I put them in a wrong way. According to the paired-end-demux.qzv file I attached before and your suggestion, I changed the parameters as follows:
Could I ask about two related questions?

For the DADA2 method, should I keep the same truncating size from each direction (forward & reverse), such as setting the parameters 290bp at --p-trunc-len-r 290\ and --p-trunc-len-f 290? or as you mentioned just make sure retain 20+bp overlap.

denoise with DADA2####

time qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-r 290
--p-trunc-len-f 290
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza

tabulate

time qiime metadata tabulate
--m-input-file stats-dada2.qza
--o-visualization stats-dada2.qzv

mv 'file name'

mv rep-seqs-dada2.qza rep-seqs.qza
mv table-dada2.qza table.qza

And yesterday I rerun follow the Alternative methods of read-joining in QIIME 2, and the interactive quality plot of demux-joined.qzv list as follows,

Could you help me check that I set the --p-trim-length 460 or 550?
My 16S amplicon is 464 (v3-v4), so I am confusing the joined reads with 567bp.
(P.S. I will put --p-jobs-to-start from now on. Thank you:)
qiime deblur denoise-16S
--i-demultiplexed-seqs demux-joined-filtered.qza
--p-trim-length 460
--p-sample-stats
--o-representative-sequences rep-seqs.qza
--o-table table.qza
--o-stats deblur-stats.qza
--p-jobs-to-start 36

Thank you so much for your help.

Mehrbod_Estaki · November 30, 2018, 4:49pm

Hi @doudou2047,
Glad things are coming along!

They certainly don't need to be the same length, the ideal situation is that you trim as much as the poor quality tails from a combination from both reads while still maintaining 20+bp overlap for proper merging.

Regarding your quality plot of the joined reads, I'm not really sure what is happening to be honest. Where you have pointed the arrows looks very odd and unnatural to me , let's see what others think is happening with that!
If your expected amplicon size is 460 bp I would certainly think that portion is not real biological reads so in your deblur setting I would certainly trim them. In fact, when using joined-paired ends with deblur it's not a bad idea to trim a bit more than 460 in case you have some naturally occurring reads that are shorter. Maybe set trim to 440 bp to be safe?

doudou2047 · November 30, 2018, 6:20pm

Hi, Dear @Mehrbod_Estaki
Thank you so much for your information.

Could I summary all I got till now? And I think I am still confusing a little bit, even I checked the same topic with "Deblur and DADA2".
I have same trouble with set the trim and truncating size with Deblur and DADA2.
According to all the information I got from this topic, I tested on my current project.
I finished the analysis pipeline according to moving picture tutorial.

The Feature table is shown 711 vs 80 based on DADA2 and Deblur, 9x difference.
The taxnomy results based on deblur & DADA2 especially the parameters I set are also different from class levels, 10 vs 15 classes.

My question is :

When viewing the data look for the point in the forward and reverse reads where quality scores decline below 25-30, we need to trim reads to this point to create high quality sequence variants, so the trim parameter should be? transacting size should be? (Sorry, I think I need to ask from very beginning to figure it out why choose different size have that big different results)
How to fix the difference from different methods? or both of them needs rerun? which one I could use represent the whole microbiome contribution?

######For Deblur########

#!/bin/bash

time qiime vsearch join-pairs \
  --i-demultiplexed-seqs demux.qza \
  --o-joined-sequences demux-joined.qza

time qiime demux summarize \
  --i-data demux-joined.qza \
  --o-visualization demux-joined.qzv

time qiime quality-filter q-score-joined \
  --i-demux demux-joined.qza \
  --o-filtered-sequences demux-joined-filtered.qza \
  --o-filter-stats demux-joined-filter-stats.qza

##Based on the file demux-joined.qzv, I choose the prameter 427.

demux-joined.qzv cut screen list below:

  #!/bin/bash
    time qiime deblur denoise-16S \
      --i-demultiplexed-seqs demux-joined-filtered.qza \
      --p-trim-length 427 \
      --o-representative-sequences rep-seqs-deblur.qza \
      --o-table table-deblur.qza \
      --p-sample-stats \
      --o-stats deblur-stats.qza
      --p-jobs-to-start 36

table_deblur.qzv cut screen list below:

taxa-bar-plots.qzv cut screen list below (level 3):10 classes

######For DADA2########
## import the data by the same way
time qiime tools import
--type 'SampleData[PairedEndSequencesWithQuality]'
--input-path pe-33-manifest
--output-path paired-end-demux.qza
--input-format PairedEndFastqManifestPhred33
time qiime demux summarize
--i-data paired-end-demux.qza
--o-visualization paired-end-demux.qzv

##Based on the paired-end-demnx.qzv I choose the trim and trancating size

paired-end-demux.qzv cut screen listed below:

#denoise with DADA2
time qiime dada2 denoise-paired
--i-demultiplexed-seqs paired-end-demux.qza
--p-trim-left-f 0
--p-trim-left-r 0
--p-trunc-len-r 171
--p-trunc-len-f 295
--o-representative-sequences rep-seqs-dada2.qza
--o-table table-dada2.qza
--o-denoising-stats stats-dada2.qza
--p-n-threads 40
#feature tables
time qiime feature-table summarize
--i-table table.qza
--o-visualization table.qzv
--m-sample-metadata-file sample-metadata.tsv
time qiime feature-table tabulate-seqs
--i-data rep-seqs.qza
--o-visualization rep-seqs.qzv

table_dada2.qzv cut screen list below: 711 features

taxa-bar-plots.qzv cut screen list below (level 3): 15 classes?

Thank you so much for your patience and your help.

Mehrbod_Estaki · December 3, 2018, 6:43am

Hi @doudou2047,
Some excellent questions and observations!

The 25-30 median q-score is a good and safe first try guideline, but every dataset is going to be different and we just have to play around to find what works best for our particular set. For example if you have 150bp overlap in your reads and a min 20bp overlap for merging, you have a total of ~130 bp to truncate from either forward, reverse, or a combination of both. In most datasets you would truncate from both, but probably more from the reverse reads since they tend to have lower quality in general. So in your case, even though the median scores look good based on the 25-30 guideline, I would still truncate more, up to 110-120bp in total. What this should do in theory is allow more reads to survive the filtering step and ultimately include more reads for denoising. This is because longer reads mean more poor reads on the 3' end and DADA2 filtering will drop reads completely when they don't meed its set filtering parameters, but if it was a bit shorter and didn't have as many poor quality reads, then it would actually retain those. It sounds a bit counter intuitive but ultimately the more of the poor quality reads you can truncate without risking a failed merge the more reads you'll end up with.

This is a tough one too. But it's very difficult to make direct comparisons with these methods in this case, but there is a good benchmark paper that might help explain the differences. Some examples of differences: Deblur's static error model tends to get much more conservative as the read lengths get longer, while DADA2 creates an error model based on your reads so it is dependent on the quality of your reads. Usually with longer reads (i.e. with paired-end) I find DADA2 retains more reads. Deblur uses a positive filter (Greengenes by default) to filter out reads while DADA2 doesn't include this step. DADA2 has its own merging function while you are using a 3rd party to merge prior to deblur, so its hard to compare those 2 merging steps a well. Different pre-filtering parameters can add more variability between the 2 methods. But all that is not so important that you should get such drastic differences in your data as you have observed. The main reason why we see such differences here is that Deblur has retained a total of ~300,000 reads while with DADA2 you've retained almost 10 million reads! So your bar plots are made of much much different sampling pools. In other words, the results of your DADA2 are rooted in much more information, so at this point I would just move forward with that.
Hope that helps!

doudou2047 · December 9, 2018, 5:44pm

Thank you so much for the information, @Mehrbod_Estaki.
I learned a lot from your replies. Thank you.
However, I complete several projects all with the Deblur pipeline without comparison with DADA2. I am worried about the results I generated before is hardly illustrate the whole scope of the microbiome.
I am afraid I need rerun all the projects based on your suggestions.
Thank you and Merry Christmas:christmas_tree:.

Mehrbod_Estaki · December 12, 2018, 7:14am

Hi @doudou2047,
I understand, it can be very frustrating but I would say your worry is valid. Though it really depends on the nature of the data. If you were comparing outcomes with vastly different # of reads (such as your above example) I would say it is certainly worth re-analyzing those datasets, for sanity's sake. If they were more similar in depth then I would expect the outcome to be similar as well. Hopefully this doesn't set you back too much. Good luck!

system · January 12, 2019, 1:14pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.