I'm trying to figure out how to approach decontaminating my data, and i'm not so sure how to do so.
I've got 71 samples: some are from stool samples and the other are blood samples and in addition to those i've got two kits (one for the stool samples and the other is for the blood samples) controls and positive and negative controls of the PCR.
Is it possible to decontaminate this data set as a whole? (because of the different biological samples - stool/blood).
The bottom line is that I want to decontaminate these samples according to the kit controls and get a data set to continue with.
This is a great question! How many negative controls do you have for the and samples? How many positive controls do you have, and what's inside of them?
I ask because this has been discussed for the last couple of years, and we don't have a perfect solution that works every time. It depends on both the controls you have, and what kind of contaminants you are hoping to remove.
I think you are off to a good start. Having both a kit control and a known positive control is a great start, as it let's you set a baseline for what's going inside the context of your runs. The problem is still pretty hard...
I'm been thinking about what Nick said, all the way back in 2017:
Like, you mention,
Sure, you could get the ASVs from your negative controls, then remove them from all other samples using this plugin. But there's a catch!
The Illumina platform suffers from index-hopping between samples, and the most abundant amplicons are the most likely to 'hop' into other samples because of a mismatched barcode. This means that a perfectly empty negative control would still get some reads assigned to it by the Illumina sequencer, and essentially be a tiny subset of all your real amplicons. Removing these real amplicons is not a good idea.
The issue is that technical / artificial variation and biological / real variation all look the same to the sequencer. Sometimes there's recognizable patterns of technical variation that you can identified as noise, then removed. However, the real biology can be noisy too.
So, what do you think is the source causing your contamination? How can you tell it apart from real biological variation?