Contaminated samples or Bioinformatics problem? [16s v4 515F/806R]

Hi all

Here is a brief overview of the study:
• 16s, V4, Primers: 515F/806R
• 80 biological samples (lizard faeces, cloacal swabs) + 22 negative controls (6 blank swabs, 11 extraction blanks, 5 PCR blanks)
• Quantification was performed after DNA extraction, PCR, cleanup, and pooling.
• PCR product was pooled at equal nanomolar concentrations
• Sequencing: Illumina MiSeq 150 x 2
• Data generated by sequencing company with BCL2FASTQ2 conversion software.
• Data was received as demultiplexed fastq files in pairs (read1 and read2) with adapters already trimmed.

Qiime2 steps
• Imported data to qiime2 artifact
• Denoising, reads joined, & ASV table constructed with DADA2

The problem:
• Various metrics showed not only no difference between treatments, but no difference between the negative controls and the biological samples.
• I filtered the features occurring in the controls from the biological samples, this however resulted in feature frequency dropping by 90% and the remaining features occurring in only a few samples each.
• I repeated this and only filtering out the PCR Blank features, but the results were similar.

rep-seqs.qzv (750.9 KB) rep-seqs-features-allControlsFiltered.qzv (599.8 KB) rep-seqs-features-PCRBlanksFiltered.qzv (715.4 KB) table.qzv (591.1 KB) table-features-allControlsFiltered.qzv (571.6 KB) table-features-PCRBlanksFiltered.qzv (622.5 KB)

I can’t work out why, but it seems like my samples are predominantly comprised of contamination (despite quantification during the labwork indicating this shouldn’t be the case).

• Is there something obvious I’m missing or doing wrong during the bioinformatics that could cause this issue. Any ideas?
• Do the filtered feature tables have any useful data? Or are they just sequencing artifacts?


1 Like

Hi @Carl_Watson,

Welcome to the forum!

During your processing, did you by chance use robotic extraction or PCR? Do your blanks look more like the samples they’re next to?
Well-to-well contamination is a know issue, you might want to look at this paper about it.

If you think its a cross-contamination issue (sample splashing into control), I wouldn’t filter based on your negative controls. But, you may also want to check out some of the contaminant filtering threads (these were just the top hits on my search).

EDIT: One more question, what various metrics are showing no difference?


1 Like

Hi Carl,
Welcome in the forum and enjoy qiime!

Looking at your sequencing design, are you sure you using MiSeq 150x2 sequencing? If it is indeed this sequencing length, I don’t expect many sequence pairs would join and that may explains the result you are seeing. Do you have the denoising stats?
The way out of this would be performing the analysis using only R1 (or R2 as you prefer).
If the sequencing length is 2x250, it is probably good to have a look at the denoising stats anyway to see if it looks alright.
If you see that your samples include most of the reads after the denoising, we may look changing your denoising settings.



Thanks for your replies @jwdebelius & @llenzi :slight_smile:

No, the processing was done manually.
Looking at the taxa-bar-plots, sorting by extraction order & PCR order, I can't see any pattern that would imply well to well contamination.

Sample labels: [Extraction batch] [extraction number] [sample-type]

Sample labels: [PCR batch] [Extraction batch] [extraction number] [sample-type]

All of these :confused:
alpha-rarefaction-94154.qzv (684.4 KB) bray_curtis_emperor.qzv (812.2 KB) evenness-group-significance.qzv (376.4 KB) faith-pd-group-significance.qzv (377.5 KB) jaccard_emperor.qzv (812.1 KB) unweighted_unifrac_emperor.qzv (812.5 KB) unweighted-unifrac-mixed-original-significance.qzv (308.1 KB) unweighted-unifrac-treatment-group-significance.qzv (377.6 KB) unweighted-unifrac-treatment-x-event-significance.qzv (609.0 KB) weighted_unifrac_emperor.qzv (812.6 KB)
If you wanted to have a look, the relevant metadata columns to sort by are:

  • subject-type (negative control vs real sample)
  • sample-type (looking at the different types of negative controls and real samples)
  • treatment-group (comparing different treatments applied to the real samples)

@llenzi, I'm quite certain it was MiSeq 150x2 sequencing.
demux-paired-end.qzv (294.3 KB) table.qzv (591.1 KB) rep-seqs.qzv (750.9 KB) denoising-stats.qzv (1.2 MB)

One thing stands out to me when I look at the taxa-bar-plot: the composition of samples within treatment groups is inconsistent, but across groups seems to be very similar. And what is more perplexing is that this pattern is the same when comparing the treatment groups to the negative controls.... taxa-bar-plots.qzv (2.6 MB) taxonomy.qzv (1.7 MB)

Phylum level

Family level

Some comments on the labwork

  • Quantification after DNA extraction (with a Quantus Fluorometer) showed the extraction blanks as ~0 ng/ul ("Lower than blank")
  • Likewise, the gel images showed amplification of the expected fragment size for all the real samples, and only a faint band for the negative controls.
  • Quantification (with Qbit) after PCR supported the gel images.

Barcodes have been checked again and confirmed to correspond with the original source, the fastq files and the metadata.

Could this be a problem with the sequencing itself?


Hi @Carl_Watson,

I agree with @llenzi on the sequence length stuff. What I'm seeing doesn't make any sense.

At 2x150 you have, at most, 300 bp. Back of the envelope on 806-515 = 291. 300-291=9, which is too short for DADA2 to merge the sequences. So, yeah, that's particularly weird as I look at your data and makes me wonder what happened in the merge. Like, you should not have seen any sequences.

So, then, where did you get your feature table with lovely merged sequences? (It doesn't solve the fact that your demux summary still shows high sequence counts in the blanks which will be denoised in DADA2).


Index hopping is a possibility, but it seems unlikely to this degree.

Have you checked those taxa against common contaminants? My initial intution says they don't look like what I might expect. (Although my gut i sa bad check on negative controls and you should look at Salter et al, 2014 and citing articles as a starting place on this, as well as some fo the threads above). Howe does your data compare to existing samples from a similar environment? Like, are there other studies of lizard clocacal or fecal samples you can compare to?

Although, I don't recommend relying on visual taxonomic comparisons for that: Im a big proponent of comparisons in non-phylogenetic metrics to see if samples share a lot of features to estimate distance. But, based on your PCoAs, the controls sit right in there, so Id guess they're very similar. I might still calculate the distance between samples and your technical variation using a mantel test. Just drop the controls that don't have an extraction position.


1 Like

Hi again @jwdebelius & @llenzi.

The 150x2 design was based on advice I was given (explained by this Illumina document).

I repeated the analysis on only the R1's, and the results seem very similar (at least to my eye).
The denoising stats seem consistent between paired & single.
denoising-stats.qzv (1.2 MB) rep-seqs.qzv (641.9 KB) table.qzv (594.0 KB) demux-single-end.qzv (289.7 KB)

Likewise with the diversity metrics, there still doesn't seem to be a difference between the negative controls and the real samples. (except the blank-swabs, which all have low read counts)
evenness-group-significance.qzv (376.3 KB) faith-pd-group-significance.qzv (376.9 KB) bray_curtis_emperor.qzv (810.5 KB) jaccard_emperor.qzv (810.5 KB) unweighted_unifrac_emperor.qzv (810.8 KB) weighted_unifrac_emperor.qzv (810.9 KB)

The most abundant taxa (at least) are mostly the same too.
taxa-bar-plots.qzv (2.8 MB) taxonomy.qzv (1.7 MB)

I need to look more into this, but I am finding common contaminants among the most abundant taxa.

Even if they are common contaminants, I don't understand how the negative controls are indistinct from the real samples (given the quantification during labwork). Or am I misinterpreting the metrics?


So, I think this is your key piece:

If they are common contaminates and you’re working in a low biomass system, then your blanks may be indistinguishable from your samples. If you’re working in human fecal samples (high biomass) and your negative controls look like your samples, then that’s presumably splash over (a particular problem with robots). However, if you think you might be in a low biomass system (which is something you can learn from the literature around your particular host species and sample type) and your negative controls contain a lot of contamination and your samples look like your negative controls, then I would be concerned about that. The quality of the labwork doesn’t solve the fact that there is unavoidable reagent contamination in this field and extraction kit has a large effect on that (See the Salter paper I linked above and Sinha et al, 2017 for more discussion).

If it is a biomass issue, I’m not a great person for discussion because my primary systems are high biomass, but it may be worth looking into techniques for things like the build environment or human skin.



It’s lizard cloacal swabs & faeces. As far as I understand, they should be high biomass samples.

I expected to find some contamination in the negative controls, but I also expected that the real samples would contain more than just these contaminants.

I’ve manged to get the bcl files from my sequencing provider, so I’m going to run the conversion and demultiplexing myself to see if that was the source of the problem.