Filter Controls or Not to Filter Controls

Nicholas_Bokulich · November 14, 2017, 4:23pm

Thanks for posting this question @Stream_biofilm this is a question that comes up a bit (in variants) on the forum and I wish there were an easy solution but there is not (that I am aware of). Currently, there are not any methods for contaminant detection that rely entirely on negative controls.

That sums up what is, unfortunately, the issue at the heart of this discussion — any solutions posted here are just an opinion unless if they are validated by extensive testing that would be required to create an "easy" method for contaminant detection based on negative controls.

@colinbrislawn does an exquisite job of outlining the core issue: that cross-indexing (technical error) is known to occur on Illumina and other sequencing platforms and is impossible to eliminate. Cross-contamination is also prevalent (though easier to control as it is largely human error). So using negative controls alone it is impossible to absolutely distinguish real from fake contamination (i.e, cross-indexing or other technical error as @colinbrislawn points out).

@Alexandra_Bastkowska shared some of her ideas in this thread and I think her approach is about the best you can do right now with only negative controls. In a nutshell: you need to carefully examine your data to determine what features are obvious contaminants (e.g., known reagent contaminants, not typical in your sample types, which may be possible for some sample types but not others).

This is not a good approach, as @mortonjt and @ebolyen have discussed in this thread. Reads and sampling depths are uneven across samples so the absolute abundance of a feature in your blank does not correspond at all to the abundance (even of a real contaminant) in other samples. This also does not really address the issue that many of the features observed in a negative control may be cross-contaminants (or cross-indexed) from another sample and hence removing a set quantity from all samples will bias the proportions of true positives in those samples.