Decontamination of 16S data in qiime2

Dear all,

I would like to find out about decontamination process in qiime2. Sorry if I may have missed a similar discussion previously. As part of our QC step during the library prep, we usually extract a negative/blank control (taken through same DNA extraction process with samples) in which we then spike a known concentration of a bacteria (of a species we don’t expect to find in our biological sample e.g cyanobacteria). So once we get the data we are able to remove the contaminants from the biological samples by generating an OTU table of the spiked controls separate from the OTU table of the biological samples. We then subtract the ‘contaminant’ OTU-reads in the spiked control that map to taxa in the biological sample. If the spiked controls were done in duplicate (like I have tried to illustrate below), then the average number of ‘contaminant’ reads is removed from the biological sample. If a biological sample OTU that matches a contaminant OTU has fewer reads than the average contaminant OTU reads, then this OTU is completely removed from the biological sample.

This has been working well for us when we run our data using an in-house 16S Nextflow pipeline generating OTU tables.

I would like to know whether this kind of process can be applicable when using qiime2 with the sequence variants. Or any other way I can be able to achieve this in qiime2 would be really helpful.

Thank you.

Regards,

1 Like

Hi @Flutomia,
See topic for a discussion of various decontamination methods:

This is a tricky topic and there are no great solutions. For what you describe, it sounds like you could use qiime feature-table filter-features or qiime taxa filter-table to remove specific features. However, I would not advice this approach. See that discussion thread for more detail but basically there are many reasons why sequences will appear in your negative control, and not all are reagent contamination. If these are cross-contamination from your real samples, “index hopping”, or “cross-talk”, then they represent sequences that are genuine, true signal in your other samples. Removing them across the board would be bad! Even subtracting a specified number of reads is a bad idea because that assumes that sequence count is closely correlated with biomass. It is not. So I would advice against that approach.

See that discussion… there are some other possible solutions out there. And if you find other software that addresses this problem, please feel free to contribute to that discussion!

I hope that helps!

1 Like

Thank you very much! Yes, it did help.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.