DADA2 truncation parameter and negative control filtering

LY333 · August 3, 2018, 1:21pm

Dear all,

I am currently working with Illumina 2x250bp V4 region amplicon data.

My DADA2 truncation and trimming parameters were the following:
First_truncation:
--p-trim-left-f 10
--p-trim-left-r 10
--p-trunc-len-f 240
--p-trunc-len-r 215 \

second_truncation:
--p-trim-left-f 10
--p-trim-left-r 10
--p-trunc-len-f 240
--p-trunc-len-r 240 \

As a result, First_truncation had 3114 features and second_truncation had 2905 features. So, shorter read produced more feature number which is consistent with the previous forum on this matter.

But my question is on negative control filtering. First_truncation retained 30% of negative control reads whereas second_truncation retained only 20% of negative control reads. These negative controls are blank and they are supposed to have mostly non-real reads. Thus, I thought maybe I should be using second_truncation for the downstream analysis. Would anyone provide an insight into this matter and give a guidance? I would greatly appreciate it. Thank you. I have attached the picture of my interactive quality plot.

Nicholas_Bokulich · August 3, 2018, 6:13pm

I'm not sure there is any rhyme or reason to this:

Blanks will include reagent contamination, cross-contamination, index jumpers, and other nonsense like that — but those are mostly real reads, and so there's no reason to assume that they will be more noisy than reads for real samples. So I would not use the negative controls for optimizing dada2 parameters in this situation.

But let's see what others have to say: @colinbrislawn and others may have some good insight on this.

I hope that helps!

Mehrbod_Estaki · August 3, 2018, 8:49pm

Just a thought, how much of the negative controls are filtered prior to denoising? Is it possible that with the second (longer) truncating you're just discarding more of those reads to begin with (thus less retaining) and with less of those reads around, the error model would be more confident in discarding some of those as non-real reads. You might benefit from doing a positive filter to see what portion of your negative control reads are real reads too.

But I agree with @Nicholas_Bokulich, I wouldn't rely on your negative controls for optimizing dada2 parameters, too much unknown...

colinbrislawn · August 4, 2018, 4:11am

Hello!

There was this big ~~argument~~ discussion about negative controls and what to do about contamination. What do you think about all of this?

Filter Controls or Not to Filter Controls

They are supposed to have no reads, but clearly they do, so we need to understand where these reads are coming from to choose our next steps.

Based on peoples ideas from that last thread and other threads, what do you think you should do? What do you think is the best way to handle the real reads that should not be in a negative control?

Colin

lewisNU · August 7, 2018, 2:17pm

Hi Colin,

I've been quite fixated on the best manner to filter potential contaminants based on negative controls in my low biomass samples, and of course came across the thread you linked earlier this year. The debate is very interesting. From all of the offered solutions i.e. decontam, subtraction, complete removal of groups appearing in controls, my thoughts are that the method by which you deal with contaminants has to be based on the characteristics of the dataset in question i.e. how many controls you have, what features would you actually expect to see in the environment you're investigating etc. Personally I don't think an ideal all encompassing method exists yet.

Currently, my approach is to pull the feature table and taxonomy data out of QIIME2 straight after the QC and taxonomic assignment steps, and follow this with a thorough manual per-sample screening of the feature table against the negative controls, with a set of chosen cut off parameters (of which i've tried numerous e.g. if the number of reads for a single taxa is present in more than 2X the quantity in a control as oppose to a sample, this feature would be removed). This method is far from ideal, however given the lack of consensus, feels fit for purpose. The main problem with this method is that once you've manipulated the feature table, you can no longer recreate or use a phylogenetic tree as far as i'm aware (if anyone knows how that would be fantastic), ruling out phylogenetic analyses, however useful they may be?

Anyone who has went down the rabbit hole of how to deal with 'potential' contaminant removal in microbial ecology studies must have felt the same anguish as I have, I dream of a world in the not so distant future where budding microbial ecologists (and the old withered ones) remove contaminants with ease, as oppose to spending their time considering (agonising over) the right method to choose, or worse yet ignoring the issue entirely (gasp).

Best wishes,
Lewis

colinbrislawn · August 7, 2018, 4:25pm

Hello Lewis,

Thanks for 'qiime-ing' in on this discussion.

I agree. And given that signal-to-noise ratio is a standing problem in engineering, we might never have a perfect solution.

Let me remind you why a perfect solution is unreachable: we are trying to measure the real biological differences while ignoring the fake tech technical differences, but all of this is inside the same data stream. The signal and the noise are mixed. Species dispersion looks like sample contamination.

While the uncertainty can invoke anguish, there are things we can do well, so we should find solace there. 1) We can make sure to publish all our samples in a public data base so others can use them. 2) We can answer peoples questions and introduce new members into the community. 3) We can use standard, reproducible analysis methods so other's can follow in our footsteps and improve on our work.

I want to highlight the power of reproducibility, especially in the face of uncertainty.

So... Qiime now includes this plugin for doing quality control. Is is perfect? No. But you can try it on your samples, and the resulting .qza and .qzv files will include exactly the command you ran, so I can follow in your footsteps and improve your method in the future.

I can't say it's perfect, but I can say it's reproducable.

Uh oh... Does "manual" mean, by hand, in Excel?
How do you know you didn't make a mistake in Excel? Everyone makes mistakes in Excel!

https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/?noredirect=on&postshare=4161472211255740&utm_term=.06c928124158

Even worse, if your method is really good, how can I try it out on my own samples? If it was an R script or qiime plugin, you could easily share it with other, or use it for benchmarking, or improve it in the future.

It might not be the perfect solution for contaminant removal, but if it's reproducible, we can share it and improve it together!

We don't have a perfect solution for contaminant removal, but we do have many solutions for reproducible research. Before we worry about problems we can't solve, we should solve all the problems we can.

Colin

P.S. Pace yourself. Research takes time, and we are all figuring this out together.

LY333 · August 8, 2018, 3:02pm

Dear Colin and all,

Thanks for your help and reply! I appreciate this active discussion. It seems to me there are 3 general consensus to the solution to this issue.

Removing all features that appear in negative control
Subtracting a constant from entire feature table
Leaving relevant OTUs that appear to be cross-contamination but discarding the obvious contaminant that are only present in negative controls.

It seems like 1st solution is the least ideal since it removes the most abundant feature. I believe 2nd one is great but I am not so sure how to do. Qiime2 artifact is in .qza file, and I save the .qzv then save the features in excel sheet through Qiime2 viewer. But in order to implement the 2nd solution, I would have to manipulate the .qza file which seems to be not plausible. If anyone knows a way of doing this, please let me know.

Thus, in my situation, I thought the 3rd approach may be the best. I will try to use the source tracker2 plugin on all my blanks and real samples and try to classify.

LY333 · August 8, 2018, 3:02pm

Dear Mehrbod,

For first_truncation,

sample-id	input	filtered	denoised	merged	non-chimeric	% retained
NC1	6489	1253	1253	1009	1009	15.55
NC10	12087	4370	4370	3756	3756	31.07
NC11	11596	4245	4245	3721	3721	32.09
NC12	10959	3620	3620	3131	3131	28.57
NC13	8017	2762	2762	2466	2466	30.76
NC2	13047	4224	4224	3607	3607	27.65
NC3	11264	4188	4188	3684	3684	32.71
NC4	7867	2930	2930	2429	2429	30.88
NC5	10304	3314	3314	2934	2934	28.47
NC6	6618	2094	2094	1762	1762	26.62
NC7	10678	3705	3705	3251	3251	30.45
NC8	6863	2484	2484	2280	2280	33.22
NC9	16593	4407	4407	3783	3783	22.80

For the second_truncation

sample-id	input	filtered	denoised	merged	non-chimeric	% retained
NC1	6489	927	927	681	681	10.49
NC10	12087	3112	3112	2629	2629	21.75
NC11	11596	2985	2985	2419	2419	20.86
NC12	10959	2432	2432	1928	1928	17.59
NC13	8017	2080	2080	1772	1772	22.10
NC2	13047	2957	2957	2516	2516	19.28
NC3	11264	3336	3336	2909	2909	25.83
NC4	7867	2165	2165	1801	1801	22.89
NC5	10304	2518	2518	1999	1999	19.40
NC6	6618	1571	1571	1301	1301	19.66
NC7	10678	2805	2805	2374	2374	22.23
NC8	6863	1834	1834	1572	1572	22.91
NC9	16593	2992	2992	2423	2423	14.60

Thanks for your input!

Nicholas_Bokulich · August 8, 2018, 5:23pm

Those options were discussed on separate threads, but I would not say there was "consensus" on all three being acceptable. In fact, I believe there was consensus on at least one being a bad idea. @colinbrislawn sums this up perfectly in pointing out that there is not a perfect solution; this is an open area of research and we all dream of a perfect future. We can, however, determine what is not a good solution in the near term.

This is a bad idea — as you note, it will often remove the most abundant features. This is because many of the features found in the negatives can in fact be cross-contaminants from true samples . So don't go this route.

A better solution when you have negative controls (I think the best I've seen so far) is to use the decontam R package. This will require exporting, decontam'ing, and then reimporting so it is a pain. But much better than other current solutions that have been discussed on this forum.

(full disclosure: I have not personally tested or benchmarked this tool, but from a conceptual standpoint it makes the most sense. I am excited about it and have discussed with the developer @benjjneb to get this added to a :qiime2: plugin, so that may be the better future you dream of at least for :qiime2: users)

Yikes! Also not a good idea, because there is no reason to assume that contaminants will be "added" to all samples evenly, nor that sequence depth will be even for all samples. Besides, again, many of these "contaminants" in the negative controls will be cross-contaminants. I believe @mortonjt discussed this at more length somewhere on the forum but I cannot find it right now. Don't do this, either.

This is the one that may have been agreed upon by "consensus" in another forum post. If you can look at your negative and rule out certain features as cross-contaminants, and determine with certainty that others are certainly reagent contaminants, for example, then you can follow @colinbrislawn's advice above to filter out specific known contaminants. But this requires a very good knowledge of what features should/should not be in your samples! So is fraught with risk and uncertainty.

Using sourcetracker in this way would require finding some well-curated reference datasets to determine different contaminant sources, including the sample types that you are analyzing, reagent contaminants, etc. The problem is that the features attributed to specific sources could still be valid observations, not the sort of contaminants that you should be filtering out.

E.g., if we detect fecal-associated features in sea water or soil, are these contaminants that we want to remove? Probably not. If we detect human skin-associated features on an indoor surface, are these contaminants that we want to remove? Still probably not.

Unless if sourcetracker attributes these to an obvious source of contamination, e.g., known reagent contaminants, I would be wary about actually using it to filter all samples. I like sourcetracker and have used it a lot, but more for tracking putative microbial sources, not for quality control.

But let's see what others have to say; maybe other forum users have used source tracker in this way and can give us some good guidance on proceeding.

WAS1 · August 14, 2018, 5:37pm

Yeah @Nicholas_Bokulich, @lewisNU 's manual style and almost all known suggestions I have read since 2014/15 about attempts to reduce contaminant to a barest minimum are all 'somehow' incorporated in the methodology of decontam. an earlier method was in this paper, Deriving accurate microbiota profiles from human samples with low bacterial content through post-sequencing processing of Illumina MiSeq data | Microbiome | Full Text

I actually believe decontam authors looked at all this to develop the method package

colinbrislawn · August 14, 2018, 8:38pm

Hey folks,

I really like this idea!

One reason I like it is that it does not effect your biological samples in any way.

If these OTUs "are only present in negative controls" then they will not be influencing your alpha diversity values or your beta diversity values (if using UniFrac or Jaccard distances). In fact, you can produce identical alpha and beta values from your biological samples by not doing anything at all, and you get to state that you filtered for contaminates in your methods section!

This is not a way to remove the 0.1% sample-to-sample crosstalk from the Illumina MiSeq.
This is not a way to remove microbes introduced by human handling.

This is the perfect way to address annoying reviewers.

Colin

Edit: My main concern when filtering is that I will hurt my real samples, and this method is guaranteed to not do that. This method will not introduce bias in your analysis.

Nicholas_Bokulich · August 16, 2018, 6:08pm

3 off-topic replies have been split into a new topic: Discussion: methods for removing contaminants and cross-talk

Please keep replies on-topic in the future.

system · September 17, 2018, 12:08am

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.