Filter Controls or Not to Filter Controls

Hello Everyone,

I have been trying to determine the best way to filter my stream biofilms eDNA samples. I have negative extraction controls and PCR controls that I included in my library prep. There seems to be a lot of debate over the best way to remove features found in control samples. I agree removing any feature found in a control could be biased but I also don’t feel comfortable ignoring features completely that are found in my controls. Should we be filtering control features that represent a certain % of total reads? I was hopping we could use the bright minds on this forum to discuss the best way to be filtering features found in control samples. I know we might not all have the same opinion but I think this is a topic that is important to have some kind of standardization across projects.



I love this question, and I think the new qiime forums are a great place to have this discussion.

First, let’s clarify why removing all features that appear in negative controls is a bad idea. The Illumina platform has known cross-contamination between samples on the machine itself, and more abundant amplicons are more likely to the reads that end up crossing into other samples. This means that a ‘perfect’ negative control would still get some reads assigned to it by the Illumina sequencer, and essentially be a tiny subsample of all real amplicons. Removing the most abundant, real amplicons from your entire study is not a good solution.

Subtracting a constant from your entire feature table might work better. For example, if your negative control has a real amplicon that appears 200 times in that sample, you could subtract 200 from every value in your entire feature table. This flat reduction still allows your most common features to remain, while removing less common feature (after all, they might just be cross contamination, like was observed in that negative control).

Removing sources of biological contamination is harder. For example, if you found human associated microbes in a lake sediment sample, are these coming from the researcher during the extraction process, or from humans swimming at the lake? How do you differentiate environmental contamination (real) from technical contamination (fake)?

This is a great question. I would love to hear what other people have tried.


PS Did you do any positive controls? Community or isolates?


@colinbrislawn unfortunately we did not include any positive controls in our study. We did not account for cross contamination by the sequencer and figured any contamination would be obvious in our negatives.

I like the idea of subtracting a constant, but how would you calculate the correct minimum read count to use for an entire feature table?


Thanks for posting this question @Stream_biofilm this is a question that comes up a bit (in variants) on the forum and I wish there were an easy solution but there is not (that I am aware of). Currently, there are not any methods for contaminant detection that rely entirely on negative controls.

That sums up what is, unfortunately, the issue at the heart of this discussion — any solutions posted here are just an opinion unless if they are validated by extensive testing that would be required to create an "easy" method for contaminant detection based on negative controls.

@colinbrislawn does an exquisite job of outlining the core issue: that cross-indexing (technical error) is known to occur on Illumina and other sequencing platforms and is impossible to eliminate. Cross-contamination is also prevalent (though easier to control as it is largely human error). So using negative controls alone it is impossible to absolutely distinguish real from fake contamination (i.e, cross-indexing or other technical error as @colinbrislawn points out).

@Alexandra_Bastkowska shared some of her ideas in this thread and I think her approach is about the best you can do right now with only negative controls. In a nutshell: you need to carefully examine your data to determine what features are obvious contaminants (e.g., known reagent contaminants, not typical in your sample types, which may be possible for some sample types but not others).

This is not a good approach, as @mortonjt and @ebolyen have discussed in this thread. Reads and sampling depths are uneven across samples so the absolute abundance of a feature in your blank does not correspond at all to the abundance (even of a real contaminant) in other samples. This also does not really address the issue that many of the features observed in a negative control may be cross-contaminants (or cross-indexed) from another sample and hence removing a set quantity from all samples will bias the proportions of true positives in those samples.


I really like the term 'cross-indexing' to differentiate issues on the sequencers from environmental 'cross-contamination.'

Given that it's hard to differentiate cross-contamination and true biological similarity, my lab has mostly focused on cross-indexing, as it's a tractable problem.

From your monoculture positive control that you included in the run.

For example, if you have a batch of human samples, you may include a saltwater microbe as a positive control. After the run has finished, you can easily evaluate cross-indexing in two different ways.

  1. Did my positive control end up in uman samples?
  2. Did my most abundant Human reads end up in my positive control?
  3. (Bonus!) Is my ASV denoising algorithm doing a good job resolving my single, known microbe in my positive control?

Based on these observed frequency of cross-indexing error, you can identify any samples that are massively contaminated, and also measure the baseline cross-indexing error.

But how do you remove this cross-indexing error from all other samples? The solution that I think would work best is outlined by @ebolyen here:

Given that filtering is inelegant, maybe it's better to leave in contamination and control for it statistically, as @mortonjt suggested:

1 Like

I totally agree! I'm just trying to move the discussion beyond 'Filter Controls or Not to Filter Controls'

1 Like

Thank you everyone for your thoughts. With qiime 1 I read that removing OTUs with abundances(reads) less than 10 across all samples was standard. Does this apply to ASV from dada2?

That approach was fairly specific to the quality filtering / OTU picking strategies available in qiime1 circa 2012. A lot has changed since then, most notably denoising methods like dada2 for Illumina data that perform more sophisticated methods for eliminating spurious reads. This thread has some more discussion of the abundance filtering methods in qiime1 (and the original paper benchmarking that approach) and whether or not they are appropriate with dada2 (probably not, but it's untested).

So no, I would not recommend performing an abundance filter after denoising with dada2, unless if such an approach is validated by benchmarking on useful datasets.

1 Like

One more option to consider: We recently released a method - decontam - that identifies contaminants based on two statistical signatures: higher frequency in low concentration samples and higher prevalence in negative controls. You can read more details in the preprint: Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data.

Right now decontam is available as an R package. In the future we intend to make decontam available through the q2-quality-control plugin, but I can’t give you a timeline on that yet.


Now that there’s a new function added to the q2-quality-control to assess the accuracy of mock estimation, I’m really looking forward to the decontam plugin in qiime 2 :slight_smile:

1 Like

Thanks for mentioning @yanxianl! For those who have not seen, these new actions are evaluate-composition and evaluate-seqs. These are designed with mock communities in mind, but could also be useful for testing simulated communities or other samples types with an “expected” composition/sequences.

We hope to have decontam as a new action in the q2-quality-control plugin by early 2018. Stay tuned!


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.