Similarity in data is an DADA2's settings artifact or a contamination in data?

Dear all,

I ran an analysis pipeline following "moving pictures tutorial " from start to end on gut samples of fish that went through a sterilisation process. Treatments are sterilised / sterilised then inoculated with core microbiota / sterilised with control water (unknown microbial composition). All taken at two separate time points.

I started the analysis with multiplexed unpaired sequences then ran through a DADA 2 pipeline with truncating at 220 for R and 170 for F (see demux quality plot attached) to reach a minimum quality of 25.

Weirdly, almost all of my samples (except the control) are somewhat identical (by taxa bar plot and diversity indices featured in the tutorial) in both a and b diversity indices, even the sterile treatment(?!). I am suspecting contamination but since the control of the first time point is significantly different and I do see some slight differences in groups between the time points I am not sure this is the case.

( see similarity between samples , columns with different patterns are the control).

I have two immidiate questions:

  1. The DADA2 stats summary indicates I am loosing about 30-50% of my raw reads post DADA 2 (input# - non chimeric#) . Am I truncating too much and missing the desired overlap? if so what would be a better approach? just using the F strand by itself? could this be the reason for these weird results?

  2. Is there a way of automatically identifying contaminations in the data? ( for instance, automatically sorting an OTU table to locate known groups associated with contaminations.

Any information would be helpful.

Hey @LH22,

Quick question on that, do you still have primers/non-biological sequences on the reads (they should be removed)? That’s a really common way to lose a lot of reads at the chimera checking step.

I removed both f and r primers using trim f/r (17 for both) . The data I received was demultiplexd so barcodes were removed.

Also, a minor correction: truncation was 220 for f and 170 for r.

I suspect the primers are still in the read, you can set trim-left to whatever their length is to get rid of that. Ideally you’ll see far fewer chimera removals afterwards.

I removed the primers on both:
see code below:

qiime dada2 denoise-paired
–i-demultiplexed-seqs /Users/qiime2_pipline/demux-paired-end.qza
–p-trim-left-f 17
–p-trunc-len-f 220
–p-trim-left-r 17
–p-trunc-len-r 170
–p-n-threads 0
–o-representative-sequences /Users/qiime2_pipline/2_DADA/rep-seqs-dada2.qza
–o-table /Users/qiime2_pipline/2_DADA/table-dada2.qza
–o-denoising-stats /Users/qiime2_pipline/2_DADA/stats-dada2.qza

Also, if you look at the DADA2 stats file you see that for many of the samples there is a significant drop in reads following the filtering and merging step. Not so many post the chimera removal step.

@LH22, comparing this with my own experience, the only area I would be worried about is merging, I think the loss is too much. the loss after filtering is still close to ranges I have seen (I often see up to 9k reads filtered out) but the reads from ur snapshot after merging is a bit scary for me (if ur shot is representative of all samples).I often see just a couple hundred reads after merging.

Yet ur trim and trunc parameters look fine (from what I will also tend to do) from ur quality graph. In total u are comparing around 203 bases across all samples.
can u reduce to 150bp in another pipeline and see if there is a difference in ur tax plots? the initial authors of dada2 and several of their papers I have seen tend to to use just 150bp even after running a 2X250bp seq run. Just a suggestion I would try

1 Like

apologies @LH22 150bp trimming was about deblur denoising not dada2. sorry

I think the example screenshot looks pretty good across the board. If you wanted to try to get better merge results, you could add a dozen or so bases to the final trunc-len and see if that helps.

Hi, problem solved? which 16S region did you target? is your control the sterile treatment?


Ran DADA2 again on only the f strands (single end) and lost a bit less reads. On average 10%-20% higher yield. Still loosing quiet a lot though. Output statistics/taxonomy didn’t really change (as expected).

I used the V4 region of 16S. All of my treatment went through a sterilization protocol. For the control, regular rearing water were used for post sterilization rearing. I do see higher variation in the control ( as expected) but am also seeing quiet a high diversity within and unusual similarity between sterile groups which is quite peculiar…

I asked because merging seemed to be where you think the problem is, but is it possible to look at how the assignment of taxonomy performed? since the worry is about how so similar the tax were

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.