I ran an analysis pipeline following "moving pictures tutorial " from start to end on gut samples of fish that went through a sterilisation process. Treatments are sterilised / sterilised then inoculated with core microbiota / sterilised with control water (unknown microbial composition). All taken at two separate time points.
I started the analysis with multiplexed unpaired sequences then ran through a DADA 2 pipeline with truncating at 220 for R and 170 for F (see demux quality plot attached) to reach a minimum quality of 25.
Weirdly, almost all of my samples (except the control) are somewhat identical (by taxa bar plot and diversity indices featured in the tutorial) in both a and b diversity indices, even the sterile treatment(?!). I am suspecting contamination but since the control of the first time point is significantly different and I do see some slight differences in groups between the time points I am not sure this is the case.
( see similarity between samples , columns with different patterns are the control).
I have two immidiate questions:
The DADA2 stats summary indicates I am loosing about 30-50% of my raw reads post DADA 2 (input# - non chimeric#) . Am I truncating too much and missing the desired overlap? if so what would be a better approach? just using the F strand by itself? could this be the reason for these weird results?
Is there a way of automatically identifying contaminations in the data? ( for instance, automatically sorting an OTU table to locate known groups associated with contaminations.
@LH22, comparing this with my own experience, the only area I would be worried about is merging, I think the loss is too much. the loss after filtering is still close to ranges I have seen (I often see up to 9k reads filtered out) but the reads from ur snapshot after merging is a bit scary for me (if ur shot is representative of all samples).I often see just a couple hundred reads after merging.
Yet ur trim and trunc parameters look fine (from what I will also tend to do) from ur quality graph. In total u are comparing around 203 bases across all samples.
can u reduce to 150bp in another pipeline and see if there is a difference in ur tax plots? the initial authors of dada2 and several of their papers I have seen tend to to use just 150bp even after running a 2X250bp seq run. Just a suggestion I would try
Ran DADA2 again on only the f strands (single end) and lost a bit less reads. On average 10%-20% higher yield. Still loosing quiet a lot though. Output statistics/taxonomy didn’t really change (as expected).
I used the V4 region of 16S. All of my treatment went through a sterilization protocol. For the control, regular rearing water were used for post sterilization rearing. I do see higher variation in the control ( as expected) but am also seeing quiet a high diversity within and unusual similarity between sterile groups which is quite peculiar…