Filter by group or entire dataset?

In general, would it be advisable to filter by group if your dataset contains multiple sample types?

For example, when working with data from one sample type, I typically filter features that are not found in >95% of samples. I’d like to do this same type of filtering but the dataset I’m working with now is a composite from multiple body sites and environmental samples. All were run on the same lane, so I know that cross-talk might be a problem, and I can handle that by filtering features that have a minimum count/abundance across all samples. However, I’m not exactly sure what is the advised filtering protocol given a multi-sample type dataset.

My thought right now is:

  1. Filter (present in min # samples) within sample type
  2. Filter (min # reads) across all sample

Does this seem reasonable? Any suggestions would be appreciated!

I am also interested on that, so following the thread.

Right now I am filtering by sub-groups within the run, but it’s too time costing because we can have different sub-groups and number of samples (inside them) at every run.
Depending on the matrix of the sample, if soil etc, my filter throws away features below a threshold of 100 counts or 0.5% (sum of counts) from all the samples.
Those filters were literally made up by me based on what I see in the results.

1 Like

Hi @smreyes,

I would start by running a PCoA to check and see if there are differences between the same types globally. You’d want to rarify your data first, so no filtering needed. (Unless you want to exclude singletons or something.) Then, if you see a difference, I would split by body site and separate.


Thanks, @jwdebelius. I have a few clarifying questions:

Are you suggesting that if the PCoA didn’t show a lot of separation, then filtering features that were not present in at least 95% of all samples is a more appropriate filtering strategy than filtering features that were not present in at least 95% of the other samples in the group?

If so, why? I guess I don’t understand why a PCoA of rarefied values (which is unlikely to include rare taxa) would be a good justification for removing rare taxa from all samples or samples by group.

Any additional clarification would be greatly appreciated! Thanks!

Hi @smreyes,

So, Id start by making a PCoA with your favorite metric on rarified data to see if there’s separation by sample type. That will help you get a feel for whether you should separate by sample type or not. Given your criteria, something like Jaccard should answer your question.

If you decide to analyze by sample type. I’d work with your data from there and filter for feature based analysis. (I would still encourage alpha and beta on the unfiltered, rarified data to check and make sure you actually have differences in the forest before yo go starting at trees.)

Second, I wanted to mention earlier, I think your filtering criteria is very stringent, possibly too stringent. Human microbiome data tends to be sparse for a lot reasons. The end result is that in a population of 100 adults, I might find my most prevelent OTU/ASV in only 80 of them, with I think a power law decrease. So, I’d recommend relaxing your filtering criteria especially in humans. I tend to have a lot of success filtering to 5-10% preference (contingent on sample size, complexity, and model.)

My rule of thumb for filtering is to check and make sure it’s okay by running a Procrustes analysis comparing a rarified distance matrix of the original data to a rarefied distance matrix of the unfiltered data. My correlation with Bray Curtis distance should be high (you want the mantel test for this, i Target about 90% or higher) and you should have decent correlation with any other metrics of interest. I like UniFrac.

So, as a quick summary, I would filter by site, but relax my filtering criteria a lot.

Finally, I apologize for mistakes, I’m on my phone and autocorrect isn’t my friend today.