I am working with paired sequence data targeting the V4 region of 16s. I am having an issue in that my feature tables are vastly different when using deblur or dada2 methods. The deblur output seems more like what I am expecting, while the dada2 has a very important feature that I cannot see in deblur...
Using DADA2 my feature table has one major feature containing ~800k reads. This feature represents mitochondrial DNA (a common contaminant clearly visible in my agarose gel). However, the other features are low in frequency and are not found across the majority of my samples. Even my bacterial standards are not showing up correctly.
Using Deblur my feature table has many features with good representation across my samples. My bacterial standards are there and I am able to classify bacteria from the silva database. However, my mitochondrial contamination, clearly visible in DADA2, is now absent. This doesn't make sense to me because I did not remove it from my reads.
There are many differences between these denoising pipelines, so I'll highlight just one that could explain the difference is results.
Deblur is reference based, dada2 is de novo ('from nothing' == no reference).
Deblur throws away that mitochondrial data because it's not in the reference (Greengenes 13_8 88% OTUs). You could argue this is a good thing.
DADA2 keeps everything, because while you consider the mitochondria to be 'contamination,' it really is in your samples. You could argue this is a good thing.
Joining and quality filtering is pretty different between these pipelines, but the use of a reference is a key difference.
That was exactly it. I ended up getting a good output by repeating the analysis in DADA2 and got it all to work this time. For whatever reason, joining reads was troublesome using DADA2 and qiime. By joining them first in PEAR, removing adapters using cutadapt, and then importing the joined reads into DADA2 I got it all to work.
You could use the SILVA database as a reference like so. It'll be more broad, and would not typically remove plastid or other sequences. Just in case you wanted to try making more direct comparisons to DADA2 etc....
I'd avoid doing this as you are violating the model assumptions of DADA2! Particularly with altered quality scores in the region of overlap of the merged reads. I'd use deblur with the SILVA db as the reference as I noted in the preceding post.
Thanks for your reply. I see about the assumption due to the quality scores being modified. I will try the deblur method you specified.
I searched for the mitochondrial sequence on SILVA and it mapped to Mitochondria with an identity score of 55. Do you know if this will be enough for Deblur to include it?
Unfortunately, this did not end up working for me. The full-length reference sequences from SILVA must not have contained the mitochondrial sequence that is found in my dataset: up to 98% of the reads in my data "missed the reference." I'm dealing with host mitochondria from Montipora capitata coral. Perhaps there is a way for me to add the missing sequence to SILVA? The sequence can be found on GenBank here.
Yes there is. You should be able to follow one of these tutorials:
From here you should be able to run qiime feature-table merge-taxa... and qiime feature-table merge-seqs ... to combine the GenBank and SILVA data.
Note that, prior to merging, there is a chance you'd have to run qiime rescript edit-taxonomy ... on the fetched NCBI taxonomy data to change the k__ prefixes to d__, so that it matches SILVA. Assuming you are downloading the standard 7 ranks (kpcofgs). Otherwise you'll have to do more editing.