In the past, I have filtered out sequences from my samples that were found in controls (PCR blank control), using the amazing support offered here in the forum (MOCK sample and PCR blanks). For this approach, the same control was related to a set of samples, making it easier to remove potential contaminants.
I just received another dataset with several fastaq files, which are actually bronchoalveolar lavage (BAL) samples. For this study, for each BAL sample, before the procedure started, a channel wash from the scope (saline) had been collected. So each channel wash control relates to only one BAL sample.
So, I’d like to filter out the sequences in each pair of samples (channel wash --> BAL sample).
Just a quick word of caution here - this process isn't without its own caveats - you risk filtering out relevant biological signal. There is a lot of discussion from @Nicholas_Bokulich sprinkled all throughout the forum on this topic, he goes into a bit more detail about some of these considerations.
Okay, with that caveat out of the way...
I can't think of a particularly easy way to filter, because it sounds like this is a paired-sample approach (please correct me if I misunderstood!).
I think, it would look like this:
For each sample (BAL set and channel wash set), filter the feature table & representative sequences to just that particular sample. You will wind up with one feature table and one rep-seqs file per sample (2 * n samples).
For each sample pair:
a) filter your sequences to exclude the features found in the wash sample.
b) filter your feature table to only include the features left in the filtered sequences
Recombine all of your rep-seqs & all of your feature tables
Thanks for your feedback. Our idea is to to a paired-sample approach, as we are considering that any sequences found in the channel wash for a particular sample should be considered potential contaminants. To do this, I’d have to the sames steps for each sample (one by one). I’m just wondering if it would be possible to use a script inside Qiime2 to do these filtering steps and merging the reads automatically.
Matt already mentioned to this, but while removing contamination using controls sounds easy, it's really hard to do well and has unintended consequences. Be careful!
Or maybe you could treat your controls like a treatment group. In this method, you keep all your samples in your cohort, but when you perform statistical testing, you pass Treatment (channel wash control vs real sample) along with your main biological categories. When you get the results, you will be able to compare the effect size of biological differences like pH, timepoint to the artificial differences (channel wash control vs real sample). Plugins like qiime longitudinal should elegantly support this.
Let me know what you think! I'm also curious how @ebolyen would make use of matched negative controls.
I don't think I have any particularly great advice on the topic, your suggestion of tracking via metadata sounds like a good way to know if there is a problem. It may also be worth checking out this preprint: https://www.biorxiv.org/content/early/2017/11/17/221499
It uses control samples and frequencies to identify contamination, but it's not available in QIIME 2 (yet!).