I am currently working with Illumina 2x250bp V4 region amplicon data.
My DADA2 truncation and trimming parameters were the following:
First_truncation:
--p-trim-left-f 10
--p-trim-left-r 10
--p-trunc-len-f 240
--p-trunc-len-r 215 \
As a result, First_truncation had 3114 features and second_truncation had 2905 features. So, shorter read produced more feature number which is consistent with the previous forum on this matter.
But my question is on negative control filtering. First_truncation retained 30% of negative control reads whereas second_truncation retained only 20% of negative control reads. These negative controls are blank and they are supposed to have mostly non-real reads. Thus, I thought maybe I should be using second_truncation for the downstream analysis. Would anyone provide an insight into this matter and give a guidance? I would greatly appreciate it. Thank you. I have attached the picture of my interactive quality plot.
I'm not sure there is any rhyme or reason to this:
Blanks will include reagent contamination, cross-contamination, index jumpers, and other nonsense like that β but those are mostly real reads, and so there's no reason to assume that they will be more noisy than reads for real samples. So I would not use the negative controls for optimizing dada2 parameters in this situation.
But let's see what others have to say: @colinbrislawn and others may have some good insight on this.
Just a thought, how much of the negative controls are filtered prior to denoising? Is it possible that with the second (longer) truncating you're just discarding more of those reads to begin with (thus less retaining) and with less of those reads around, the error model would be more confident in discarding some of those as non-real reads. You might benefit from doing a positive filter to see what portion of your negative control reads are real reads too.
But I agree with @Nicholas_Bokulich, I wouldn't rely on your negative controls for optimizing dada2 parameters, too much unknown...
They are supposed to have no reads, but clearly they do, so we need to understand where these reads are coming from to choose our next steps.
Based on peoples ideas from that last thread and other threads, what do you think you should do? What do you think is the best way to handle the real reads that should not be in a negative control?
I've been quite fixated on the best manner to filter potential contaminants based on negative controls in my low biomass samples, and of course came across the thread you linked earlier this year. The debate is very interesting. From all of the offered solutions i.e. decontam, subtraction, complete removal of groups appearing in controls, my thoughts are that the method by which you deal with contaminants has to be based on the characteristics of the dataset in question i.e. how many controls you have, what features would you actually expect to see in the environment you're investigating etc. Personally I don't think an ideal all encompassing method exists yet.
Currently, my approach is to pull the feature table and taxonomy data out of QIIME2 straight after the QC and taxonomic assignment steps, and follow this with a thorough manual per-sample screening of the feature table against the negative controls, with a set of chosen cut off parameters (of which i've tried numerous e.g. if the number of reads for a single taxa is present in more than 2X the quantity in a control as oppose to a sample, this feature would be removed). This method is far from ideal, however given the lack of consensus, feels fit for purpose. The main problem with this method is that once you've manipulated the feature table, you can no longer recreate or use a phylogenetic tree as far as i'm aware (if anyone knows how that would be fantastic), ruling out phylogenetic analyses, however useful they may be?
Anyone who has went down the rabbit hole of how to deal with 'potential' contaminant removal in microbial ecology studies must have felt the same anguish as I have, I dream of a world in the not so distant future where budding microbial ecologists (and the old withered ones) remove contaminants with ease, as oppose to spending their time considering (agonising over) the right method to choose, or worse yet ignoring the issue entirely (gasp).
I agree. And given that signal-to-noise ratio is a standing problem in engineering, we might never have a perfect solution.
Let me remind you why a perfect solution is unreachable: we are trying to measure the real biological differences while ignoring the fake tech technical differences, but all of this is inside the same data stream. The signal and the noise are mixed. Species dispersion looks like sample contamination.
While the uncertainty can invoke anguish, there are things we can do well, so we should find solace there. 1) We can make sure to publish all our samples in a publicdatabase so others can use them. 2) We can answer peoples questions and introduce new members into the community. 3) We can use standard, reproducible analysis methods so other's can follow in our footsteps and improve on our work.
I want to highlight the power of reproducibility, especially in the face of uncertainty.
So... Qiime now includes this plugin for doing quality control. Is is perfect? No. But you can try it on your samples, and the resulting .qza and .qzv files will include exactly the command you ran, so I can follow in your footsteps and improve your method in the future.
I can't say it's perfect, but I can say it's reproducable.
Uh oh... Does "manual" mean, by hand, in Excel?
How do you know you didn't make a mistake in Excel? Everyone makes mistakes in Excel!
Even worse, if your method is really good, how can I try it out on my own samples? If it was an R script or qiime plugin, you could easily share it with other, or use it for benchmarking, or improve it in the future.
It might not be the perfect solution for contaminant removal, but if it's reproducible, we can share it and improve it together!
We don't have a perfect solution for contaminant removal, but we do have many solutions for reproducible research. Before we worry about problems we can't solve, we should solve all the problems we can.
Colin
P.S. Pace yourself. Research takes time, and we are all figuring this out together.
Thanks for your help and reply! I appreciate this active discussion. It seems to me there are 3 general consensus to the solution to this issue.
Removing all features that appear in negative control
Subtracting a constant from entire feature table
Leaving relevant OTUs that appear to be cross-contamination but discarding the obvious contaminant that are only present in negative controls.
It seems like 1st solution is the least ideal since it removes the most abundant feature. I believe 2nd one is great but I am not so sure how to do. Qiime2 artifact is in .qza file, and I save the .qzv then save the features in excel sheet through Qiime2 viewer. But in order to implement the 2nd solution, I would have to manipulate the .qza file which seems to be not plausible. If anyone knows a way of doing this, please let me know.
Thus, in my situation, I thought the 3rd approach may be the best. I will try to use the source tracker2 plugin on all my blanks and real samples and try to classify.
Those options were discussed on separate threads, but I would not say there was "consensus" on all three being acceptable. In fact, I believe there was consensus on at least one being a bad idea. @colinbrislawn sums this up perfectly in pointing out that there is not a perfect solution; this is an open area of research and we all dream of a perfect future. We can, however, determine what is not a good solution in the near term.
This is a bad idea β as you note, it will often remove the most abundant features. This is because many of the features found in the negatives can in fact be cross-contaminants from true samples . So don't go this route.
A better solution when you have negative controls (I think the best I've seen so far) is to use the decontam R package. This will require exporting, decontam'ing, and then reimporting so it is a pain. But much better than other current solutions that have been discussed on this forum.
(full disclosure: I have not personally tested or benchmarked this tool, but from a conceptual standpoint it makes the most sense. I am excited about it and have discussed with the developer @benjjneb to get this added to a :qiime2: plugin, so that may be the better future you dream of at least for :qiime2: users)
Yikes! Also not a good idea, because there is no reason to assume that contaminants will be "added" to all samples evenly, nor that sequence depth will be even for all samples. Besides, again, many of these "contaminants" in the negative controls will be cross-contaminants. I believe @mortonjt discussed this at more length somewhere on the forum but I cannot find it right now. Don't do this, either.
This is the one that may have been agreed upon by "consensus" in another forum post. If you can look at your negative and rule out certain features as cross-contaminants, and determine with certainty that others are certainly reagent contaminants, for example, then you can follow @colinbrislawn's advice above to filter out specific known contaminants. But this requires a very good knowledge of what features should/should not be in your samples! So is fraught with risk and uncertainty.
Using sourcetracker in this way would require finding some well-curated reference datasets to determine different contaminant sources, including the sample types that you are analyzing, reagent contaminants, etc. The problem is that the features attributed to specific sources could still be valid observations, not the sort of contaminants that you should be filtering out.
E.g., if we detect fecal-associated features in sea water or soil, are these contaminants that we want to remove? Probably not. If we detect human skin-associated features on an indoor surface, are these contaminants that we want to remove? Still probably not.
Unless if sourcetracker attributes these to an obvious source of contamination, e.g., known reagent contaminants, I would be wary about actually using it to filter all samples. I like sourcetracker and have used it a lot, but more for tracking putative microbial sources, not for quality control.
But let's see what others have to say; maybe other forum users have used source tracker in this way and can give us some good guidance on proceeding.
One reason I like it is that it does not effect your biological samples in any way.
If these OTUs "are only present in negative controls" then they will not be influencing your alpha diversity values or your beta diversity values (if using UniFrac or Jaccard distances). In fact, you can produce identical alpha and beta values from your biological samples by not doing anything at all, and you get to state that you filtered for contaminates in your methods section!
This is not a way to remove the 0.1% sample-to-sample crosstalk from the Illumina MiSeq.
This is not a way to remove microbes introduced by human handling.
This is the perfect way to address annoying reviewers.
Colin
Edit: My main concern when filtering is that I will hurt my real samples, and this method is guaranteed to not do that. This method will not introduce bias in your analysis.