Tutorial for filtering controls

Mechah · May 4, 2017, 4:53pm

Hi,
as different controls (sequencing controls, DNA extraction controls, sample blanks, field blanks etc.) become more and more common in amplicon studies I thought it might be helpful to provide a QIIME2 tutorial specifically dealing with filtering sequences/features of controls. Here, I would suggest additional filtering options: e.g. subtraction of control read counts from counts in feature tables, filtering of RSVs coming from DADA2 in a similar fashion as the exclude_seqs_by_blast.py script from QIIME 1.9.1. etc.
Let me know your opinions...

jairideout · May 5, 2017, 8:27pm

Hi @Mechah! qiime feature-table filter-features supports filtering out features (i.e. sequences) based on feature/sequence IDs or feature metadata. Check out the index-based filtering and metadata-based filtering sections of the filtering tutorial. The tutorial also covers qiime feature-table filter-samples for sample-based filtering in a similar manner.

The tutorial doesn't explicitly cover filtering of controls because it's intended to provide general filtering strategies. These filtering strategies can be used to perform the types of control filtering you're describing.

Do you still think it's worth having a "control"-based filtering tutorial or does the existing tutorial serve that purpose well enough?

Mechah · May 8, 2017, 7:36am

Hi @jairideout, the filtering tutorial is great for filtering features or sequences based on index or metadata. However, I would prefer to have also a filtering option by subtraction. Often we have very low counts in our control reads, but very high counts of the same features in our actual samples. So if I simply filter out features that are present in my controls, irrespective of abundances, I think I would introduce a bigger bias in some cases. Therefore I would recommend to add the possibility of subtracting feature tables (samples minus controls) and remove only those features which become zero or negative values.
I think this option would be really helpful to the present filtering strategies. Together with a QIIME2 script similar to exclude_seqs_by_blast.py from QIIME 1.9.1., the filtering tutorial would serve all needs to filter controls and process the data much more convenient.
I would be really happy to see such additional options in a future QIIME2 release...

ebolyen · May 8, 2017, 6:06pm

I don't think you could just subtract outright as each sample is going to have a very different total frequency of OTUs. However, I wonder if you would be able to scale the control's counts to a sample's total frequency. It won't be perfect as there'll still be sequences which the control didn't see, but could have, if it had happened to have more sequencing depth, but that's a problem with a qualitative filter anyhow.

@mortonjt does a lot of compositional statistics, he would have a better grasp than I of what could make sense here.

mortonjt · May 9, 2017, 12:50am

Right. Given what @ebolyen suggested - I'd strongly recommend against straight up subtracting the reads. For one, the reads are unevenly distributed across the samples. In addition, since we are most concerned with the proportion of reads, its not clear how to best account for these confounders.

If we are dealing with typical lab contaminants - the most straightforward approach is to identify them, and just remove the columns associated with that contaminant.

If you think that you are dealing with some sort of bias that is widely distributed across your samples, check out some of the compositional statistics available in skbio and gneiss. Particularly if you have some sort of prior information about your bias.

Mechah · May 9, 2017, 8:13am

@ebolyen and @mortonjt
Thanks for your input!
To clarify: Subtraction of frequencies certainly just would make sense if you apply it on normalized / rarefied data, to work on even sequencing depths.
For built environment studies typical lab contaminants (DNA extraction kits, sampling devices, PCR controls, sequencing controls etc.) very often resemble microbial communities that can be found in low biomass (built) environments. So it's really hard to identify contaminants in such cases.
As you recommended - I'll check out those compositonal statistics in skbio and gneiss.
Thanks for the discussion!

Jia · December 27, 2018, 4:02pm

Hi, Mechah

I want to do the same filtering for collection blanks, samples blanks and field blanks as you mentioned last year. Just wondering how did you go with this analysis eventually.

Thanks,

Jia

Mechah · January 10, 2019, 9:00am

Hi Jia,

sorry for my late reply due to the Christmas break…

How to handle controls in our NGS projects is still a matter of debate in our lab. However, from my point of view I would suggest to do the following (I’ll only tackle data handling and won’t mention wet lab methods to get rid of the “kitome” in this post):

First of all you should process as many controls as possible. I would consider negative controls (NTCs, field blanks etc.), positive controls (DNA of mock communities), and DNA extraction controls as a minimum.

Then I would process the data of my biological samples in parallel with these controls. The next steps is to check the composition (alpha and beta diversity) of your controls and how they relate to your biological samples.

If your controls are very dissimilar from your biological samples then you could use them as a baseline or control in the frame of your whole analysis. You could use tools like LEfSe, MaAsLin, ancom, gneiss etc. to investigate the composition of your controls and maybe define a “kitome” for your study. Your positive controls can be processed with q2-quality-control to understand the quality of your sequencing data. Finally if there is no overlap between your biological samples and your controls you do not have to filter them from your data, but describe them in your study.
If your controls are similar to your biological samples the hard work starts. We often work in low biomass environments and therefore have to estimate if a certain ASV (amplicon sequence variant) present in both sample types (biological samples and controls) makes sense for a microbial ecologist. Usually I use two main methods at the moment to filter controls. First the tool decontam and then subtraction of normalized ASV tables (for instance if we work with skin samples and see an ASV assigned to Staphylococcus aureus or any other typical skin bacteria). Then I compare my filtered data analysis with the original data. Dependent on the results I usually include both analysis in a manuscript. Sometimes it makes sense to show your controls in relation to your biological samples and sometimes it is better to show data that is based on a filtered dataset.

It is just really important that the reader comprehends what you did and why you did a filtering of your data in a paper.

You could also check out the three publications below:

https://www.nature.com/articles/s41564-018-0202-y

https://msystems.asm.org/content/3/3/e00218-17?utm_source=TrendMDmSystems&utm_medium=TrendMDmSystems&utm_campaign=trendmdalljournals_0

Hope I could help and I’m looking forward to any comments!

Cheers, Mechah