thanks @benjjneb and @jordenrabasco for the tool again. I had a situation where quite a few features were not removed from PCR and DNA extraction controls, and would be thankful for your feedback.
This was based on the prevalence method. Here are the number of features before and after decontam for the PCR ("NTC") and DNA extraction ("KITBLANK") blanks, as well as the histogram distribution of the score with an initial threshold of 0.1 for the plot, but was later increased to 0.5 score for the actual filtering, where the before/after numbers come from.
Would the remaining features in the controls really be spillover sequences from the actual samples, since their distribution would be closer to those in the actual samples? Or how should this be interpreted otherwise. Also in the case of one extraction control, where no sequences (290) were flagged as contaminants. I find it odd that one extraction control would be so heavily contaminated with actual samples that none of the sequences are marked as contaminants, or vice versa that the kit contamination was so evident in all samples that the kit-ome sequences are not recognized as contaminants. Which does not happen with any other extraction blanks. What would your interpretation for this be, what could the remaining sequences be or where do they come from? Thank you.
Unfortunately I have no DNA input conc. data so I can't use both methods which would give better results I suppose.
PS. Along the line of identifying contaminants based on frequency distributions. Do you think it would be worth or possible to try to find biological contamination with your tools? E.g. animal tissue samples stored in seawater, then use seawater samples as controls to remove features more present in or coming from seawater, as opposed to animal tissue?
Update: In addition to removing features with decontam, I later remove the control samples from the abundance table with feature-table filter-samples and use the new abundance table to filter the features accordingly with feature-table filter-seqs, which would remove all ASVs for which the remaining samples have/had a count of 0, I believe. This amounted to some ~100 features additionally removed. If none of these were present in the real samples at all (i.e. having a 0 count), why were these not labelled as contaminants by decontam if they occurred only in the control samples?
Hi @lxsteiner thanks of using the plugin!
I cannot tell if the features in your controls are spillover sequences from your other samples as I would need to see all of your other asvs/prevalences, etc, however that is a possibility. I am not sure if I am understanding what you are meaning by the “kitome sequences”. Are you assuming that your high asv count in your controls have come from the kitomes? I agree you do have a large amount of ASVs in your controls, how did you resolve your ASVs? What were the prevalences of those ASVs identified in your highly contaminated sample? Do you have any conc. information either from before the PCR step or prior to library pooling?
As to the biological contamination with these tools I think that it would be possible to remove those seawater feature given the experimental outline that you gave however check to make sure that those reads identified as contaminants are what you would expect from seawater.
Are you saying that features that only appeared in the negative controls weren’t removed? Often the prevalence method needs a prevalence of at least two to make a calculated score and otherwise will give an NA score which will assume that the sequence is a true feature. The prevalence method relies on the fact that the contaminant ASVs that are more prevalent in control samples than in experimental samples. Therefore the contaminant ASVs that only appeared in one control may not be identified as contaminants by Decontam.
Thank you for the feedback. I realized the "features" I extracted above actually refer to the read number and not the actual frequency or feature count, I was misled by the similar naming.
I will extract the correct numbers and look into their abundance between real and control samples to answer the points you raised, and report back. Sorry for the confusion in the meantime.