Here is a brief overview of the study:
• 16s, V4, Primers: 515F/806R
• 80 biological samples (lizard faeces, cloacal swabs) + 22 negative controls (6 blank swabs, 11 extraction blanks, 5 PCR blanks)
• Quantification was performed after DNA extraction, PCR, cleanup, and pooling.
• PCR product was pooled at equal nanomolar concentrations
• Sequencing: Illumina MiSeq 150 x 2
• Data generated by sequencing company with BCL2FASTQ2 conversion software.
• Data was received as demultiplexed fastq files in pairs (read1 and read2) with adapters already trimmed.
Qiime2 steps
• Imported data to qiime2 artifact
• Denoising, reads joined, & ASV table constructed with DADA2
The problem:
• Various metrics showed not only no difference between treatments, but no difference between the negative controls and the biological samples.
• I filtered the features occurring in the controls from the biological samples, this however resulted in feature frequency dropping by 90% and the remaining features occurring in only a few samples each.
• I repeated this and only filtering out the PCR Blank features, but the results were similar.
I can’t work out why, but it seems like my samples are predominantly comprised of contamination (despite quantification during the labwork indicating this shouldn’t be the case).
Questions:
• Is there something obvious I’m missing or doing wrong during the bioinformatics that could cause this issue. Any ideas?
• Do the filtered feature tables have any useful data? Or are they just sequencing artifacts?
During your processing, did you by chance use robotic extraction or PCR? Do your blanks look more like the samples they're next to?
Well-to-well contamination is a know issue, you might want to look at this paper about it.
If you think its a cross-contamination issue (sample splashing into control), I wouldn't filter based on your negative controls. But, you may also want to check out some of the contaminant filtering threads (these were just the top hits on my search).
EDIT: One more question, what various metrics are showing no difference?
Looking at your sequencing design, are you sure you using MiSeq 150x2 sequencing? If it is indeed this sequencing length, I don't expect many sequence pairs would join and that may explains the result you are seeing. Do you have the denoising stats?
The way out of this would be performing the analysis using only R1 (or R2 as you prefer).
If the sequencing length is 2x250, it is probably good to have a look at the denoising stats anyway to see if it looks alright.
If you see that your samples include most of the reads after the denoising, we may look changing your denoising settings.
No, the processing was done manually.
Looking at the taxa-bar-plots, sorting by extraction order & PCR order, I can't see any pattern that would imply well to well contamination.
One thing stands out to me when I look at the taxa-bar-plot: the composition of samples within treatment groups is inconsistent, but across groups seems to be very similar. And what is more perplexing is that this pattern is the same when comparing the treatment groups to the negative controls.... taxa-bar-plots.qzv (2.6 MB) taxonomy.qzv (1.7 MB)
I agree with @llenzi on the sequence length stuff. What I'm seeing doesn't make any sense.
At 2x150 you have, at most, 300 bp. Back of the envelope on 806-515 = 291. 300-291=9, which is too short for DADA2 to merge the sequences. So, yeah, that's particularly weird as I look at your data and makes me wonder what happened in the merge. Like, you should not have seen any sequences.
So, then, where did you get your feature table with lovely merged sequences? (It doesn't solve the fact that your demux summary still shows high sequence counts in the blanks which will be denoised in DADA2).
Good!
Index hopping is a possibility, but it seems unlikely to this degree.
Have you checked those taxa against common contaminants? My initial intution says they don't look like what I might expect. (Although my gut i sa bad check on negative controls and you should look at Salter et al, 2014 and citing articles as a starting place on this, as well as some fo the threads above). Howe does your data compare to existing samples from a similar environment? Like, are there other studies of lizard clocacal or fecal samples you can compare to?
Although, I don't recommend relying on visual taxonomic comparisons for that: Im a big proponent of comparisons in non-phylogenetic metrics to see if samples share a lot of features to estimate distance. But, based on your PCoAs, the controls sit right in there, so Id guess they're very similar. I might still calculate the distance between samples and your technical variation using a mantel test. Just drop the controls that don't have an extraction position.
I repeated the analysis on only the R1's, and the results seem very similar (at least to my eye).
The denoising stats seem consistent between paired & single. denoising-stats.qzv (1.2 MB) rep-seqs.qzv (641.9 KB) table.qzv (594.0 KB) demux-single-end.qzv (289.7 KB)
I need to look more into this, but I am finding common contaminants among the most abundant taxa.
Even if they are common contaminants, I don't understand how the negative controls are indistinct from the real samples (given the quantification during labwork). Or am I misinterpreting the metrics?
If they are common contaminates and you're working in a low biomass system, then your blanks may be indistinguishable from your samples. If you're working in human fecal samples (high biomass) and your negative controls look like your samples, then that's presumably splash over (a particular problem with robots). However, if you think you might be in a low biomass system (which is something you can learn from the literature around your particular host species and sample type) and your negative controls contain a lot of contamination and your samples look like your negative controls, then I would be concerned about that. The quality of the labwork doesn't solve the fact that there is unavoidable reagent contamination in this field and extraction kit has a large effect on that (See the Salter paper I linked above and Sinha et al, 2017 for more discussion).
If it is a biomass issue, I'm not a great person for discussion because my primary systems are high biomass, but it may be worth looking into techniques for things like the build environment or human skin.
It's lizard cloacal swabs & faeces. As far as I understand, they should be high biomass samples.
I expected to find some contamination in the negative controls, but I also expected that the real samples would contain more than just these contaminants.
I've manged to get the bcl files from my sequencing provider, so I'm going to run the conversion and demultiplexing myself to see if that was the source of the problem.