How to decide on filtering criteria for rare taxa

fquerdasi · August 16, 2021, 11:17pm

Hi!

I am relatively new to microbiome analysis, and would appreciate any opinions or thoughts on this topic.

I am at the stage where I am filtering rare taxa/OTUs from my dataset before analysis, and am struggling to figure out 1) what is the best criterion for filtering? 2) how strict should I be with filtering, i.e., what is a reasonable percentage of taxa in your dataset to filter out/reasonable final number of taxa to have?

From what I can tell the decision about which criterion to use is very dependent on the characteristics of one's dataset. My dataset has 470 samples and 1,230 taxa (before filtering). When I use the criterion of filtering out OTUs that are non-zero in at least 10% of samples, I keep 233 out of 1,230 taxa. Is 233 taxa a reasonable number on which to perform diversity and biomarker analyses? If not, what would be more reasonable? Does anyone have resources on determining filtering criteria that they've found particularly helpful and wouldn't mind sharing?

Thank you! Any help is much appreciated!
Fran

colinbrislawn · August 17, 2021, 2:13pm

Hello Fran,

While there is evidence to suggest that Independent filtering increases detection power for high-throughput experiments and Qiime2 provides a plugin to do it, I don't think there's a consensus about filtering.

There's no perfect threshold because this is a trade-off between common and rare taxa, and reviewer three will always want something different

So why use just one threshold? Why not look for biomarkers in ubiquitous taxa that appear in most samples, and also biomarker taxa that appear only in specific groups of interest?

In diversity analysis, you can also look at rare taxa and common taxa without any filtering, just by using different metrics. Unweighted UniFrac look at all taxa equally so is sensitive to changes in rare taxa, while Weighted UniFrac is weighted by abundance so is biased towards changes in common taxa. Comparing differences in Weighted vs Unweighted shows if more changes are happening in rare or common taxa, no filtering required!

I hope this helps!
Colin

ChrisKeefe · August 17, 2021, 5:41pm

Why do you want to get rid of rare taxa?

As far as I'm aware, there is no inherent value in throwing away biological data. Sometimes people filter data for resource/performance reasons - if you can get away without doing this, you should. Waste not, want not, right? Further, everyone's data is different, so "reasonable percentages" might make your data look more "normal" at the high cost of its uniqueness.

When possible, try to think about your data in biological terms

There is a certain amount of noise inherent to the data we get from sequencing technologies, and we often use tools like denoising and filtering to attempt to reduce noise, so that the data we analyze better represents the actual biological community under study. "Will process X improve the fidelity of my data?" is often a useful razor for deciding how, and how much to filter.

Like the trade-offs Colin mentions between common and rare taxa, there is an inherent tension between noise reduction and the risk of removing rare organisms. The goal is to remove data that is not meaningful to our study, without introducing new biases. A few common approaches to this idea follow, but it's ultimately up to you to decide whether any of these are appropriate to your work, and to justify your choices.

Filtering out samples with unusually low read counts

If you expect a relatively consistent sampling depth, it might be reasonable to drop samples with fewer than some threshold number of reads, with the rationale that there was a failure of some kind in sampling or processing, and they are likely to poorly represent the community sampled.

If 94 of my 96 samples have over 10k+ reads, and two have fewer than 1000 reads, I might use qiime feature-table filter-samples to drop those two because I think they are compromised.

Filtering out artifactual features

If you expect some consistency across sampled communities, it might be reasonable to drop features that appear in fewer than some threshold number of samples, with the rationale that those features are probably artifacts.

If I had 100 mouse fecal samples, from a colony of mice raised together in controlled conditions, I might use 'qiime feature-table filter-features` to filter out any features that appear in only one sample, because I think it is unlikely that only one mouse, at one period of time, would host that organism.

Filtering out contamination

Host sequences, or sequences from contaminants introduced during sampling, lab work, or sequencing, may impact your ability to draw conclusions about the microbial community you're trying to study. If you're getting good results without removing contamination, many people avoid dealing with it entirely - this is not an easy problem to solve. Try searching this forum for posts on contamination removal if you find you need to remove suspected contaminants, but I think that's beyond our scope here.

Good luck!