Contamination and High Biomass samples


It has been awhile since I have posted! Its nice to engage with y'all again and I hope your pandemic experience is going okay.

I have some samples from 2018 from macaques. I ran EMP primers for 16S on a MiSeq for 12 experimental animals and four control animals. I have weekly samples for each animal for three months. My samples include stool and vaginal mucosa. The stool DNA isolation was run manually and the negative control had undetectable DNA by nanodrop. The other mucosal samples' DNA isolations were run robotically. The DNAs are organized in plates by date (so all animals on that date are together) with at least two negative controls (each date I processed, I took a PBS aliquot and ran it through everything, I also had a water sample for each DNA extraction). I ran these through PCR manually in triplicate, but I ran them in a plate with multichannel. All these samples are in three separate MiSeq runs and grouped together by location - i.e. the vaginal samples are together, the stool samples are together, and then there is another MiSeq with half stool/half vaginal.

Currently, I am looking at stool from one time-point to figure out how I am going to process these samples. The stool looks similar in composition to other published macaque stool microbiomes. However, I did notice that the PBS (negative control) has the lower abundances of the genera found in the stool samples (for instance, I found a genera at 30% in stool but 3% in water). I am assuming these are contaminants from stool (high biomass) to water (low biomass) as these samples have been found in previous stool samples and the higher abundance taxa are found in water in lower numbers. This has also been found in a couple of other qiime posts (How to deal with contaminations that might be partly real, Contaminated samples or Bioinformatics problem? [16s v4 515F/806R] - #2 by jwdebelius, and some others)

I am still hoping to get useful information out of this dataset. I was thinking about how to address this cross contamination issue (also laid out in and Here is the general plan of what I am going to do, please let me know if I am not thinking of things or if you have extra suggestions.

  1. Merge reads and trim the primers
  2. Quality control, but ensure they are all the same length
  3. Denoise with dada2 (run each MiSeq run separately - or should I run them by sample type (stool or vaginal)?)
  4. Filter ASVs base on relative abundance (i.e If there is 30% of a genus in a stool sample and 3% in the associated negative control for that date, I would subtract 30-3% = 27%.) I am also considering running the source identifiers in qiime as suggested in other posts and decontam. Should I rarefact/limit samples to the same amount of reads?
  5. Run taxonomy and all diversity analyses with all the samples across multiple MiSeq runs together with the feature tables updated to be corrected via negative sample.

I know there is some conversations still happening around this and that even now, published data isn't very clear on how to contaminant correct in stool samples. Is this approach appropriate? What things aren't I thinking of? I want to be transparent about my data and make it reproducible.

1 Like

Hi @hbussan,

I’ve found several strategies to be quite useful when it comes to identifying reagent contaminants in maker-gene surveys. Here’s my two cents:

  1. Follow the general principles outlined in the decontam paper: (1) Sequences from contaminating taxa are likely to have frequencies that inversely correlate with sample DNA concentration and (2) sequences from contaminating taxa are likely to have higher prevalence in control samples than in true samples. Quantifying the bacterial DNA in your samples, say by qPCR, is really helpful when screening contaminants based on the principle 1 (sample bacterial DNA concentration negatively correlates with the relative abundance of contaminants). You have good confidence in classifying contaminating features when combining these two methods together.

  2. Consult the literature. In my experience, there’s always a certain degree of cross-contamination between samples and negative controls. Consulting the literature to find out common taxa reported in your biological samples and negative controls really helps when you can’t make a decision based on the aforementioned principles. Common reagent contaminants have been reported in many independent studies. For example, studies by Salter et al., 2014 and Weyrich et al., 2018; review by Eisenhofer et al., 2019.

  3. Making use of correlations between reagent contaminants. As reagent contaminants are introduced into biological samples in a fixed ratio, these taxa usually show strong correlations (de Goffau et al., 2018). After identifying reagent contaminants with absolute confidence, you can run a correlation analysis to find out correlated taxa, which provides a list of potential contaminants to work on.

A word of caution: removing taxa from your analysis may critically influence your results. Always check the “contaminant” features before you remove them. Personally, I prefer to do it manually by inspecting the distribution of taxa found in the negative and their correlations with sample bacterial DNA concentration. Here’s the link to the github repository showing what I did in one of my studies.



Just a few comments (note, I am the author of the w2w paper and the low biomass katharoseq paper). My expertise is mostly in molecular biology.

Attempting to measure w2w contamination is really only possible after sequencing. Nanodrop has a really narrow range (can’t detect < 1-5 ng/ul) which is quite a high biomass of microbial cells. Even if you used a qubit, you won’t be able to detect microbial contamination unless its really bad. You could try to do a 16S qPCR if you really wanted to spot check, but in general its always best to sequence and then assess. RE w2w and low biomass, in the papers I recommend always including a range of positive controls, namely a serial dilution starting at ~1e8 cells to 1 cell. You can choose any isolate no problem and just do 10 fold serial dilutions to extinction followed by processing these samples through DNA extraction, PCR and sequencing. For sample processing, including all samples including DNA and PCR libraries in equal volumes will also help you to establish criteria for discerning contaminants (read counts do have a correlation with original biomass - particularly for the low end range, but you need to have equal volumes all around to enable this comparison).

Removing ASVs is a very delicate process as often times (as you noted) the ‘contaminants’ are indeed from your actual sample. I would be very cautious about tossing out ASVs.

Do you know how the samples were processed? If equal volumes were used, you may be able to follow the general katharoseq guidelines and use a read count as a way to exclude certain samples.



Hi @jjminich! Thanks for your response and for your work with well to well contamination!

I have sequenced the samples already. Unfortunately, we can’t re-do them because all the sequencing was done in 2018. Based on what I am seeing, the stool samples seemed to have contaminated the negative control DNA free water controls.

As far as sample processing, I added equal volume of sample (300 ul) to each DNA isolation and then an equal volume of sample to the PCR reaction. I added 10 ul of each barcoded sample to the sequence library (except for two within a run that were low yield, I added 20 ul) to get between 200 to 500 ng of each sample in the MiSeq run.

Would karthoseq be appropriate given the wet lab preparation?


I am currently conducting a DNA analysis on lemurs in Madagascar. And I would like to get information on the steps and the continuation of your analysis of macaque stool and mucosa sampling.

What are the instruments used to measure the DNA concentration?