I have been trying to determine the best way to filter my stream biofilms eDNA samples. I have negative extraction controls and PCR controls that I included in my library prep. There seems to be a lot of debate over the best way to remove features found in control samples. I agree removing any feature found in a control could be biased but I also don’t feel comfortable ignoring features completely that are found in my controls. Should we be filtering control features that represent a certain % of total reads? I was hopping we could use the bright minds on this forum to discuss the best way to be filtering features found in control samples. I know we might not all have the same opinion but I think this is a topic that is important to have some kind of standardization across projects.
I love this question, and I think the new qiime forums are a great place to have this discussion.
First, let’s clarify why removing all features that appear in negative controls is a bad idea. The Illumina platform has known cross-contamination between samples on the machine itself, and more abundant amplicons are more likely to the reads that end up crossing into other samples. This means that a ‘perfect’ negative control would still get some reads assigned to it by the Illumina sequencer, and essentially be a tiny subsample of all real amplicons. Removing the most abundant, real amplicons from your entire study is not a good solution.
Subtracting a constant from your entire feature table might work better. For example, if your negative control has a real amplicon that appears 200 times in that sample, you could subtract 200 from every value in your entire feature table. This flat reduction still allows your most common features to remain, while removing less common feature (after all, they might just be cross contamination, like was observed in that negative control).
Removing sources of biological contamination is harder. For example, if you found human associated microbes in a lake sediment sample, are these coming from the researcher during the extraction process, or from humans swimming at the lake? How do you differentiate environmental contamination (real) from technical contamination (fake)?
This is a great question. I would love to hear what other people have tried.
PS Did you do any positive controls? Community or isolates?
@colinbrislawn unfortunately we did not include any positive controls in our study. We did not account for cross contamination by the sequencer and figured any contamination would be obvious in our negatives.
I like the idea of subtracting a constant, but how would you calculate the correct minimum read count to use for an entire feature table?
Thanks for posting this question @Stream_biofilm this is a question that comes up a bit (in variants) on the forum and I wish there were an easy solution but there is not (that I am aware of). Currently, there are not any methods for contaminant detection that rely entirely on negative controls.
That sums up what is, unfortunately, the issue at the heart of this discussion — any solutions posted here are just an opinion unless if they are validated by extensive testing that would be required to create an “easy” method for contaminant detection based on negative controls.
@colinbrislawn does an exquisite job of outlining the core issue: that cross-indexing (technical error) is known to occur on Illumina and other sequencing platforms and is impossible to eliminate. Cross-contamination is also prevalent (though easier to control as it is largely human error). So using negative controls alone it is impossible to absolutely distinguish real from fake contamination (i.e, cross-indexing or other technical error as @colinbrislawn points out).
@Alexandra_Bastkowska shared some of her ideas in this thread and I think her approach is about the best you can do right now with only negative controls. In a nutshell: you need to carefully examine your data to determine what features are obvious contaminants (e.g., known reagent contaminants, not typical in your sample types, which may be possible for some sample types but not others).
This is not a good approach, as @mortonjt and @ebolyen have discussed in this thread. Reads and sampling depths are uneven across samples so the absolute abundance of a feature in your blank does not correspond at all to the abundance (even of a real contaminant) in other samples. This also does not really address the issue that many of the features observed in a negative control may be cross-contaminants (or cross-indexed) from another sample and hence removing a set quantity from all samples will bias the proportions of true positives in those samples.
That approach was fairly specific to the quality filtering / OTU picking strategies available in qiime1 circa 2012. A lot has changed since then, most notably denoising methods like dada2 for Illumina data that perform more sophisticated methods for eliminating spurious reads. This thread has some more discussion of the abundance filtering methods in qiime1 (and the original paper benchmarking that approach) and whether or not they are appropriate with dada2 (probably not, but it’s untested).
So no, I would not recommend performing an abundance filter after denoising with dada2, unless if such an approach is validated by benchmarking on useful datasets.
Thanks for mentioning @yanxianl! For those who have not seen, these new actions are evaluate-composition and evaluate-seqs. These are designed with mock communities in mind, but could also be useful for testing simulated communities or other samples types with an “expected” composition/sequences.
We hope to have decontam as a new action in the q2-quality-control plugin by early 2018. Stay tuned!