DADA2 and Deblur outputs are extremely different

Hello,

I am working with paired sequence data targeting the V4 region of 16s. I am having an issue in that my feature tables are vastly different when using deblur or dada2 methods. The deblur output seems more like what I am expecting, while the dada2 has a very important feature that I cannot see in deblur...

Using DADA2 my feature table has one major feature containing ~800k reads. This feature represents mitochondrial DNA (a common contaminant clearly visible in my agarose gel). However, the other features are low in frequency and are not found across the majority of my samples. Even my bacterial standards are not showing up correctly.

Using Deblur my feature table has many features with good representation across my samples. My bacterial standards are there and I am able to classify bacteria from the silva database. However, my mitochondrial contamination, clearly visible in DADA2, is now absent. This doesn't make sense to me because I did not remove it from my reads.

Could someone help me understand what is going on?

Here are my methods for each method:
DADA2

  1. Create manifest file of forward and reverse reads
  2. Import reads into qiime 2 using the manifest file
  3. Use cutadapt to remove primers from forward and reverse reads
  4. Use DADA2 to create ASV's trimming sequences 0 for the start and 130 for the end

Deblur

  1. Use PEAR to join forward and reverse reads
  2. Use cutadapt to remove forward and reverse primers
  3. Create manifest file using the single end format
  4. Import reads into qiime2 using the manifest file
  5. Quality filter reads
  6. Create ASVs using deblur and trimming all reads to 190 bp ** I chose 190 bp to include the mitochondrial reads whose lengths are ~204 bp

Hello Stephan,

There are many differences between these denoising pipelines, so I'll highlight just one that could explain the difference is results.

Deblur is reference based, dada2 is de novo ('from nothing' == no reference).

Deblur throws away that mitochondrial data because it's not in the reference (Greengenes 13_8 88% OTUs). You could argue this is a good thing.
DADA2 keeps everything, because while you consider the mitochondria to be 'contamination,' it really is in your samples. You could argue this is a good thing.

Joining and quality filtering is pretty different between these pipelines, but the use of a reference is a key difference.

1 Like

Thank you Colin!

That was exactly it. I ended up getting a good output by repeating the analysis in DADA2 and got it all to work this time. For whatever reason, joining reads was troublesome using DADA2 and qiime. By joining them first in PEAR, removing adapters using cutadapt, and then importing the joined reads into DADA2 I got it all to work.

Thanks again!

Hi @Stephan_Bitterwolf,

You could use the SILVA database as a reference like so. It'll be more broad, and would not typically remove plastid or other sequences. Just in case you wanted to try making more direct comparisons to DADA2 etc.... :notebook:

qiime deblur denoise-other \
  --i-demultiplexed-seqs seqs.qza \
  --i-reference-seqs silva_repseqs.qza \
  --p-jobs-to-start 8 \
  --p-trim-length 250 \
  --p-sample-stats \
  --o-representative-sequences deblur-repseqs.qza \
  --o-table deblur-table.qza \
  --o-stats deblur-stats.qza \
  --verbose

Where silva_repseqs.qza can be the full length sequence file located on the Data resources page.

This is how I use deblur. :slight_smile:

1 Like

Hi @Stephan_Bitterwolf,

I just noticed your reply:

:warning: I'd avoid doing this as you are violating the model assumptions of DADA2! Particularly with altered quality scores in the region of overlap of the merged reads. I'd use deblur with the SILVA db as the reference as I noted in the preceding post. :warning:

1 Like

Hi @SoilRotifer,

Thanks for your reply. I see about the assumption due to the quality scores being modified. I will try the deblur method you specified.

I searched for the mitochondrial sequence on SILVA and it mapped to Mitochondria with an identity score of 55. Do you know if this will be enough for Deblur to include it?

I am sure it will, as I always filter mitochondria and chloroplasts from my data, even when using deblur as described. :slight_smile:

1 Like

Hi @SoilRotifer,

Unfortunately, this did not end up working for me. The full-length reference sequences from SILVA must not have contained the mitochondrial sequence that is found in my dataset: up to 98% of the reads in my data "missed the reference." I'm dealing with host mitochondria from Montipora capitata coral. Perhaps there is a way for me to add the missing sequence to SILVA? The sequence can be found on GenBank here.

Thanks for your guidance.

Yes there is. You should be able to follow one of these tutorials:

From here you should be able to run qiime feature-table merge-taxa... and qiime feature-table merge-seqs ... to combine the GenBank and SILVA data.

Note that, prior to merging, there is a chance you'd have to run qiime rescript edit-taxonomy ... on the fetched NCBI taxonomy data to change the k__ prefixes to d__, so that it matches SILVA. Assuming you are downloading the standard 7 ranks (kpcofgs). Otherwise you'll have to do more editing.

2 Likes

I was able to add the sequence and it fixed my issue! Deblur no longer gets rid of 98% of the sequences. Thank you @SoilRotifer and @colinbrislawn.

2 Likes

Awesome @Stephan_Bitterwolf! :tada:

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.