Discussion: methods for removing contaminants and cross-talk

lewisNU · August 16, 2018, 4:46pm

Just to add further to this discussion, I've previously used decontam prior to it being incorporated into QIIME 2, and whilst I'm a huge fan of the principles it's built around, it didn't work fantastically for my samples. I will caveat that as I did only attempt the prevalence method with 2 negative controls (one kit negative and one sequencing negative) for around 20 samples, which is likely too low a number of controls to have any real statistical power, I'm not sure what the ideal number of negative controls for the prevalence based method would be. I found that when playing around with the threshold parameter, I couldn't find a happy medium and that I was either removing features that to me appeared to be potentially true or keeping features that were questionable to my eye. Going forward I would be interested to see how the plugin performs based on quantification data, and will hopefully be testing this in the coming weeks on new data.

In these uncertain times, I do like the suggestion of a sort of 'lite' version of contaminant removal, whereby you can show you've addressed the issue by removing a small group of clear contaminants i.e. those described in the Salter paper where using extraction kits, however despite this being the safest option when addressing potential contaminants, it can also feel careless leaving in the taxa with 10 reads in sample A, which had 50000 reads in the negative control, and is likely a result of Illumina crosstalk, here I feel the only option is a commonsensical approach, although I note this isn't reproducable, perhaps we are now nearing the stage where a very large round table is to be required to produce published consensus guidelines on how to deal with these anomalies.

So long as we are actively discussing the issues, working toward reproducible solutions and not butchering our data in the meantime, that's all we can do! And in what seems like quite a progressive time for the field, I take considerable comfort in that.

Lewis

Nicholas_Bokulich · August 16, 2018, 4:59pm

Thanks for the report @lewisNU! I have not tried this method yet, or heard any reports on the forum from others who have given independent tests of it, so it is very good to hear your results.

Just to be clear: decontam is not yet incorporated in QIIME 2 and is not associated with the core :qiime2: development team. I am excited to try this method out and incorporate it into :qiime2: when I can — and am very excited to hear about what others are doing for contaminant detection and removal.

We don't have a physical table, but this forum could be a great place to get that conversation started (it seems to be already — based on this conversation and earlier threads)

It could be really useful (in a new topic) to start this conversation and have folks share papers describing methods that they use for quality control. Ideally, we can form some consensus of community standards to recommend to others. Or, at the very least, we could work to incorporate a few of the top methods for QC/contaminant removal in :qiime2: so that we at least have better options that suit what our users need.

colinbrislawn · August 16, 2018, 5:57pm

Great idea!

When thinking about contamination removal, I guess I want to start with known sources first, because our understanding of their mechanism gives us a starting point when we try to reverse their effect. So I'm nominating crosstalk as our first target for contaminate removal.

Of course Robert Edgar already has a method for this: software & paper PDF

I can contribute a large data set of positive controls, of a verity of microbes, grown in monoculture, over a period of years. This is an ideal data set for addressing how many external reads get added into a clonal sample from Illumina crosstalk.

Colin

I like this idea! Now how could we build a benchmark that demonstrates that our common sense is doing a good job finding and removing contaminates. Because you know reviewer 3 had no common sense at all

Nicholas_Bokulich · August 16, 2018, 6:11pm

@colinbrislawn @lewisNU hope you don't mind I decided to split this to a new topic, rather than leave this discussion buried in a user support thread. Let the round table begin.

Here's another paper on silencing cross-talk: Quality filtering of Illumina index reads mitigates sample cross-talk - PMC

Have not used, but I like the simplicity.

Any thoughts? Any more tools to recommend?

colinbrislawn · August 16, 2018, 6:34pm

Here's some more:

Paper PDF: Index Switching Causes “Spreading-Of-Signal” Among Multiplexed Samples In Illumina HiSeq 4000 DNA Sequencing
Paper PDF: An adaptive decorrelation method removes Illumina DNA base-calling errors caused by crosstalk between adjacent clusters

Illumina has a whitepaper on crosstalk on their own platform! PDF

Colin

natavicula · August 16, 2018, 7:30pm

I recently used decontam on a heavily contaminant-influenced dataset (mostly very low-biomass water samples). The prevalence method worked very well for me at the default threshold, removing the obvious contaminants and only a handful of sequences that I was unsure about. I had 16 negative controls (a mixture of filter blanks, extraction blanks, and PCR blanks) against 234 samples. Based on how it worked for me, I definitely prefer using prevalence-based decontam to removing suspected contaminant taxa myself.

I will note that the frequency method didn't work particularly well for my samples, possibly because many actual samples had DNA concentrations below detection limit. Maybe people with higher-biomass samples will have better results using frequency-based decontam.

WAS1 · August 17, 2018, 8:12pm

Hi everyone,
if there are 10 blank control samples with 3 of them having a particular OTU, and there are 100 actual samples and 9 of them of have that same OTU. decontam will almost surely classify it as TRUE contaminant in the prevalent method because 3/10 is greater than 9/100 and will remove this automatically if used by default.

Could this kind of stuff not be because of cross talk of a real OTU rather than a contaminant? what are your thoughts

I tend to think with decontam, it may be useful to still examine your data manually and see if prevalence is pretty high or not.

colinbrislawn · August 17, 2018, 8:47pm

Hi!

That defiantly sounds like crosstalk to me! The one twist is that if this OTU 'crossed' into 3/10 negative controls, then it's pretty likely that it also crossed into other real samples, maybe even more than 9/100.

I've yet to review 'decontam' but I know this problem is very hard so I'm pretty skeptical of existing solutions.

Colin

WAS1 · August 17, 2018, 9:20pm

I was actually reviewing decontam in a real dataset. I have actually observed your skepticism over time!!!!!
Many believe quite rightly that dominance or "prevalence" of an OTU is evidence to remove. The dataset I was reviewing just highlighted for me if the prevalence cutoff should be somehow decided per data.

So in the case for decontam, i have seen 8% prevalence of an OTU in blank versus 3% in actual samples gets that OTU removed (p is below 0.1). Yet if the OTU were cross talking, then its a real one and lost. Probably of interest in some kind of data

lewisNU · August 22, 2018, 12:41pm

An interesting read from the Fierer lab http://fiererlab.org/2018/08/15/garbage-in-garbage-out-wrestling-with-contamination-in-microbial-sequencing-projects/

benjjneb · August 22, 2018, 8:38pm

As the developer of decontam, this is a super-useful thread to get some feedback on its use! A couple comments:

Yes, that is to be expected. The discriminating power of the prevalence method is limited by the number of negative controls sequenced. If just 1 negative control, its not even worth using, and just 2 is still too few to be very effective. In the revised decontam preprint we offer guidance of 5-6 sequenced negative controls, although things only improve further with more.

That's good to hear! Especially because we would expect the prevalence method to work well with that many negative controls.

Also very useful, I had not thought of this situation, but it makes sense. If most of the samples are below detection limit, it means there basically is not concentration information for most, thus the frequency method won't work as well. Note to self: add this situation to our guidance somewhere.

Yes decontam will flag such OTUs (if supported by a high enough control sample count). In principle I think it should, as I don't believe the dominant cross-talk mechanisms are not expected to preferentially show up in negative controls.

decontam does not deal with cross-talk! The most convincing thing method I've seen on cross-talk so far is this: Computational correction of index switching in multiplexed sequencing libraries | Nature Methods

WAS1 · August 23, 2018, 12:20pm

Hi, Do you think the method applied in the paper, which focused on HiSeq and NextSeq cross talk, could be appropriately applied to targetted microbiome seq which is often done on MiSeq? Thank you

benjjneb · August 23, 2018, 3:50pm

Yes, although their method is based on a dual indexing scheme, so is not applicable to single-indexed libraries.

WAS1 · August 24, 2018, 5:51pm

@bejjneb thanks for the replies. I was going to try understand why both frequency or prevalence methods in decontam were not removing some 'clear cut contaminants' when I examined some data processed in q2. Because they are only in one blank sample? surely there were other OTUs already removed from this data after passing thru decontam.
Probably best to post on the github?
but here is a shot.

benjjneb · August 24, 2018, 9:09pm

Happy to help, but I think as decontam is not in Q2 (yet) that it's best to move this discussion over to decontam's issues forum: Issues · benjjneb/decontam · GitHub

ben · August 24, 2018, 9:15pm

Is there a manuscript I can review? thank you. ben

benjjneb · August 24, 2018, 11:17pm

For decontam? Yes, a preprint: Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data

WAS1 · August 27, 2018, 10:16am

If I may ask, how do you know this did not work for your samples?, no p values generated or how did you determine this? I am trying to understand

Nicholas_Bokulich · January 10, 2019, 2:20pm

Cross-referencing another post here. @Mechah provided some great discussion of his lab's protocols for contaminant removal from low-biomass samples in this topic:

Zach_Burcham · January 17, 2019, 6:33pm

Hi @natavicula,

I'm currently messing around with decontam based on this introduction:
https://benjjneb.github.io/decontam/vignettes/decontam_intro.html#identifying-contaminants-in-marker-gene-and-metagenomics-data

I also used the prevalence method which detected a few contaminate sequences in my batches, but my question is how did you go forward after identifying the contaminates? The intro just kind of stop after the identification and I can't seem to find the best way to remove them (unless the idea is I just need to delete the rep-seqs that are contaminations). What did you do to remove your contamination after decontam identified them?

Thanks,
Zach