DADA2 truncation parameter and negative control filtering

Dear all,

I am currently working with Illumina 2x250bp V4 region amplicon data.

My DADA2 truncation and trimming parameters were the following:
First_truncation:
--p-trim-left-f 10
--p-trim-left-r 10
--p-trunc-len-f 240
--p-trunc-len-r 215 \

second_truncation:
--p-trim-left-f 10
--p-trim-left-r 10
--p-trunc-len-f 240
--p-trunc-len-r 240 \

As a result, First_truncation had 3114 features and second_truncation had 2905 features. So, shorter read produced more feature number which is consistent with the previous forum on this matter.

But my question is on negative control filtering. First_truncation retained 30% of negative control reads whereas second_truncation retained only 20% of negative control reads. These negative controls are blank and they are supposed to have mostly non-real reads. Thus, I thought maybe I should be using second_truncation for the downstream analysis. Would anyone provide an insight into this matter and give a guidance? I would greatly appreciate it. Thank you. I have attached the picture of my interactive quality plot.

I'm not sure there is any rhyme or reason to this:

Blanks will include reagent contamination, cross-contamination, index jumpers, and other nonsense like that — but those are mostly real reads, and so there's no reason to assume that they will be more noisy than reads for real samples. So I would not use the negative controls for optimizing dada2 parameters in this situation.

But let's see what others have to say: @colinbrislawn and others may have some good insight on this.

I hope that helps!

3 Likes

Just a thought, how much of the negative controls are filtered prior to denoising? Is it possible that with the second (longer) truncating you’re just discarding more of those reads to begin with (thus less retaining) and with less of those reads around, the error model would be more confident in discarding some of those as non-real reads. You might benefit from doing a positive filter to see what portion of your negative control reads are real reads too.

But I agree with @Nicholas_Bokulich, I wouldn’t rely on your negative controls for optimizing dada2 parameters, too much unknown…

1 Like

Hello!

There was this big argument discussion about negative controls and what to do about contamination. What do you think about all of this?


They are supposed to have no reads, but clearly they do, so we need to understand where these reads are coming from to choose our next steps.

Based on peoples ideas from that last thread and other threads, what do you think you should do? What do you think is the best way to handle the real reads that should not be in a negative control?

Colin

1 Like

Hi Colin,

I’ve been quite fixated on the best manner to filter potential contaminants based on negative controls in my low biomass samples, and of course came across the thread you linked earlier this year. The debate is very interesting. From all of the offered solutions i.e. decontam, subtraction, complete removal of groups appearing in controls, my thoughts are that the method by which you deal with contaminants has to be based on the characteristics of the dataset in question i.e. how many controls you have, what features would you actually expect to see in the environment you’re investigating etc. Personally I don’t think an ideal all encompassing method exists yet.

Currently, my approach is to pull the feature table and taxonomy data out of QIIME2 straight after the QC and taxonomic assignment steps, and follow this with a thorough manual per-sample screening of the feature table against the negative controls, with a set of chosen cut off parameters (of which i’ve tried numerous e.g. if the number of reads for a single taxa is present in more than 2X the quantity in a control as oppose to a sample, this feature would be removed). This method is far from ideal, however given the lack of consensus, feels fit for purpose. The main problem with this method is that once you’ve manipulated the feature table, you can no longer recreate or use a phylogenetic tree as far as i’m aware (if anyone knows how that would be fantastic), ruling out phylogenetic analyses, however useful they may be?

Anyone who has went down the rabbit hole of how to deal with ‘potential’ contaminant removal in microbial ecology studies must have felt the same anguish as I have, I dream of a world in the not so distant future where budding microbial ecologists (and the old withered ones) remove contaminants with ease, as oppose to spending their time considering (agonising over) the right method to choose, or worse yet ignoring the issue entirely (gasp).

Best wishes,
Lewis

3 Likes

Hello Lewis,

Thanks for 'qiime-ing' in on this discussion.

I agree. And given that signal-to-noise ratio is a standing problem in engineering, we might never have a perfect solution.

Let me remind you why a perfect solution is unreachable: we are trying to measure the real biological differences while ignoring the fake tech technical differences, but all of this is inside the same data stream. The signal and the noise are mixed. Species dispersion looks like sample contamination. :man_shrugging:

While the uncertainty can invoke anguish, there are things we can do well, so we should find solace there. 1) We can make sure to publish all our samples in a public data base so others can use them. 2) We can answer peoples questions and introduce new members into the community. 3) We can use standard, reproducible analysis methods so other's can follow in our footsteps and improve on our work.


I want to highlight the power of reproducibility, especially in the face of uncertainty.

So... Qiime now includes this plugin for doing quality control. Is is perfect? No. But you can try it on your samples, and the resulting .qza and .qzv files will include exactly the command you ran, so I can follow in your footsteps and improve your method in the future.

I can't say it's perfect, but I can say it's reproducable.

Uh oh... Does "manual" mean, by hand, in Excel?
How do you know you didn't make a mistake in Excel? Everyone makes mistakes in Excel! :scream_cat:

https://www.washingtonpost.com/news/wonk/wp/2016/08/26/an-alarming-number-of-scientific-papers-contain-excel-errors/?noredirect=on&postshare=4161472211255740&utm_term=.06c928124158

Even worse, if your method is really good, how can I try it out on my own samples? If it was an R script or qiime plugin, you could easily share it with other, or use it for benchmarking, or improve it in the future.

It might not be the perfect solution for contaminant removal, but if it's reproducible, we can share it and improve it together!


We don't have a perfect solution for contaminant removal, but we do have many solutions for reproducible research. Before we worry about problems we can't solve, we should solve all the problems we can.
:recycle: :microscope:

Colin


P.S. Pace yourself. Research takes time, and we are all figuring this out together. :+1:

3 Likes

Dear Colin and all,

Thanks for your help and reply! I appreciate this active discussion. It seems to me there are 3 general consensus to the solution to this issue.

  1. Removing all features that appear in negative control
  2. Subtracting a constant from entire feature table
  3. Leaving relevant OTUs that appear to be cross-contamination but discarding the obvious contaminant that are only present in negative controls.

It seems like 1st solution is the least ideal since it removes the most abundant feature. I believe 2nd one is great but I am not so sure how to do. Qiime2 artifact is in .qza file, and I save the .qzv then save the features in excel sheet through Qiime2 viewer. But in order to implement the 2nd solution, I would have to manipulate the .qza file which seems to be not plausible. If anyone knows a way of doing this, please let me know.

Thus, in my situation, I thought the 3rd approach may be the best. I will try to use the source tracker2 plugin on all my blanks and real samples and try to classify.

1 Like

Dear Mehrbod,

For first_truncation,

sample-id input filtered denoised merged non-chimeric % retained
NC1 6489 1253 1253 1009 1009 15.55
NC10 12087 4370 4370 3756 3756 31.07
NC11 11596 4245 4245 3721 3721 32.09
NC12 10959 3620 3620 3131 3131 28.57
NC13 8017 2762 2762 2466 2466 30.76
NC2 13047 4224 4224 3607 3607 27.65
NC3 11264 4188 4188 3684 3684 32.71
NC4 7867 2930 2930 2429 2429 30.88
NC5 10304 3314 3314 2934 2934 28.47
NC6 6618 2094 2094 1762 1762 26.62
NC7 10678 3705 3705 3251 3251 30.45
NC8 6863 2484 2484 2280 2280 33.22
NC9 16593 4407 4407 3783 3783 22.80

For the second_truncation

sample-id input filtered denoised merged non-chimeric % retained
NC1 6489 927 927 681 681 10.49
NC10 12087 3112 3112 2629 2629 21.75
NC11 11596 2985 2985 2419 2419 20.86
NC12 10959 2432 2432 1928 1928 17.59
NC13 8017 2080 2080 1772 1772 22.10
NC2 13047 2957 2957 2516 2516 19.28
NC3 11264 3336 3336 2909 2909 25.83
NC4 7867 2165 2165 1801 1801 22.89
NC5 10304 2518 2518 1999 1999 19.40
NC6 6618 1571 1571 1301 1301 19.66
NC7 10678 2805 2805 2374 2374 22.23
NC8 6863 1834 1834 1572 1572 22.91
NC9 16593 2992 2992 2423 2423 14.60

Thanks for your input!

2 Likes

Those options were discussed on separate threads, but I would not say there was "consensus" on all three being acceptable. In fact, I believe there was consensus on at least one being a bad idea. @colinbrislawn sums this up perfectly in pointing out that there is not a perfect solution; this is an open area of research and we all dream of a perfect future. We can, however, determine what is not a good solution in the near term.

This is a bad idea — as you note, it will often remove the most abundant features. This is because many of the features found in the negatives can in fact be cross-contaminants from true samples :disappointed:. So don't go this route.

A better solution when you have negative controls (I think the best I've seen so far) is to use the decontam R package. This will require exporting, decontam'ing, and then reimporting so it is a pain. But much better than other current solutions that have been discussed on this forum.

(full disclosure: I have not personally tested or benchmarked this tool, but from a conceptual standpoint it makes the most sense. I am excited about it and have discussed with the developer @benjjneb to get this added to a :qiime2: plugin, so that may be the better future you dream of at least for :qiime2: users)

Yikes! Also not a good idea, because there is no reason to assume that contaminants will be "added" to all samples evenly, nor that sequence depth will be even for all samples. Besides, again, many of these "contaminants" in the negative controls will be cross-contaminants. I believe @mortonjt discussed this at more length somewhere on the forum but I cannot find it right now. Don't do this, either.

This is the one that may have been agreed upon by "consensus" in another forum post. If you can look at your negative and rule out certain features as cross-contaminants, and determine with certainty that others are certainly reagent contaminants, for example, then you can follow @colinbrislawn's advice above to filter out specific known contaminants. But this requires a very good knowledge of what features should/should not be in your samples! So is fraught with risk and uncertainty.

Using sourcetracker in this way would require finding some well-curated reference datasets to determine different contaminant sources, including the sample types that you are analyzing, reagent contaminants, etc. The problem is that the features attributed to specific sources could still be valid observations, not the sort of contaminants that you should be filtering out.

E.g., if we detect fecal-associated features in sea water or soil, are these contaminants that we want to remove? Probably not. If we detect human skin-associated features on an indoor surface, are these contaminants that we want to remove? Still probably not.

Unless if sourcetracker attributes these to an obvious source of contamination, e.g., known reagent contaminants, I would be wary about actually using it to filter all samples. I like sourcetracker and have used it a lot, but more for tracking putative microbial sources, not for quality control.

But let's see what others have to say; maybe other forum users have used source tracker in this way and can give us some good guidance on proceeding.

2 Likes

Yeah @Nicholas_Bokulich, @lewisNU 's manual style and almost all known suggestions I have read since 2014/15 about attempts to reduce contaminant to a barest minimum are all ‘somehow’ incorporated in the methodology of decontam. an earlier method was in this paper, doi.org/10.1186/s40168-015-0083-8

I actually believe decontam authors looked at all this to develop the method package

Hey folks,

I really like this idea!

One reason I like it is that it does not effect your biological samples in any way.

If these OTUs "are only present in negative controls" then they will not be influencing your alpha diversity values or your beta diversity values (if using UniFrac or Jaccard distances). In fact, you can produce identical alpha and beta values from your biological samples by not doing anything at all, and you get to state that you filtered for contaminates in your methods section!

This is not a way to remove the 0.1% sample-to-sample crosstalk from the Illumina MiSeq.
This is not a way to remove microbes introduced by human handling.

This is the perfect way to address annoying reviewers. :+1:

Colin

Edit: My main concern when filtering is that I will hurt my real samples, and this method is guaranteed to not do that. This method will not introduce bias in your analysis.

3 off-topic replies have been split into a new topic: Discussion: methods for removing contaminants and cross-talk

Please keep replies on-topic in the future.

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.