Abundance filter for bacterial community analysis

Kat · August 29, 2019, 11:28am

Good day,

After processing my 16S rRNA dataset with QIIME2, I am now exploring and analysing the data. I was about to filter ASVs with low overall relative abundance using the method recommended by Bokulich et al. 2013 (with a threshold c = 0.005%).

I noticed that changing this threshold from 0.005% to 0.0005% or 0.0001% was dramatically altering the results (from 569 ASVs retained to 3875 or 9671 ASVs retained). I therefore wanted to ask for advice, as I understand that filtering the data is important to avoid spurious sequences, but I would like to avoid losing too much information?

Also, I came across another discussion saying that it is not necessary to filter the data if dada2 is used for denoising: Alpha-diversity after filtering - #2 by Nicholas_Bokulich

Please excuse me if this is a stupid question, I am not sure to understand why abundance filtering is not necessary after processing data with dada2? As I did exactly that, should I actually not be filtering my data?

Thank you so much for your help!

Best wishes,

Kat

jwdebelius · August 29, 2019, 1:37pm

Hi @Kat,

This is a complex, and definitely not stupid!

I think abundance filtering and thresholding depends a lot on your question and some on your method. Denoising methods (DADA2 and Deblur) both pre-filter their ASV tables and require at least 10 counts for a feature to be retained. This wasn't a requirement in OTU picking methods, and as a result, the filtering got implemented.

For me, personally, I tend not to filter at all before my diversity analyses because I want to capture as much as possible (especially important for metrics or techniques which are sensitive to rare features).

But, I like to filter before I do a feature-based analysis, simply because my signal to noise ratio improves. So, I tend to be relatively aggressive in my filtering and try to capture features that are prevalence (more than 5-10% of samples) and reasonably abundant (I often set this based on a rarefaction cut-off); although to be honest, I tend to use a composite threshold that isn't implemented here. I do this under the assumption that I'm underpowered to detect things with those features anyway (because one observation in one group does not a difference make).

Best,
Justine

Nicholas_Bokulich · August 29, 2019, 2:46pm

Just want to chime in here — those thresholds were designed for OTU-clustered data and have not be re-evaluated for denoised sequences. dada2 and deblur developers did their own benchmarks and have their own abundance filtering thresholds built in, so further filtering is not required, but has its uses as @jwdebelius described. The pre-clustering quality filtering described in Bokulich 2013 still has its uses and is actually still used for filtering/trimming prior to deblur, but the post-clustering abundance thresholds are probably too stringent following denoising.

Kat · August 29, 2019, 5:28pm

Hi @jwdebelius and @Nicholas_Bokulich,

Thank you so much for your rapid answers and for your explanations! It is so helpful to have your support and being able to ask for advice.

I will therefore not filter my data after denoising with dada2 for diversity analyses; it is useful to understand that abundance filtering is already built-in.

@jwdebelius, may I please ask on what kind of feature-based analysis you would add this additional filtering step?

Thank you very much again!

Best wishes,

Kat

jwdebelius · August 30, 2019, 7:41am

Hi @Kat,

I tend to do filtering before I run something like ANCOM, SCNIC, Gneiss, or take my data out into Phylofactor or PhILR.

Best,
Justine

Kat · August 30, 2019, 8:22am

Hi @jwdebelius,

Great, thank you very much for your answer!

Best wishes,

Kat