Filter out of samples: diversity vs differential abundance


I want to know how you usually do when you need to filter out (rarefy) some samples with low reads before diversity analysis, and then you want to perform differential abundance (ancom, gneiss, lefse, or whatever). Do you preserve all samples for differential abundance or perform it only including the same samples as used in diversity? Is Methodologically OK?

Thanks a lot!

Hi @KirKara,
Good question! First to clarify, filtering and rarefying are 2 different things and not exclusive of each other. When you rarefy you set a minium n read threshold which is used to discard any samples that don’t have at least n reads, then all other samples that have more than n are randomly subsampled to n without replacement. Filtering in the context of removing low frequency features or under surveyed samples can happen before or after rarefying or not at all.
All diversity analyses of microbiome data need to account for the uneven sampling depth. Rarefying is one method that historically has been used due to its convenience. There are other normalization methods out there which are also quite common (DESeq2, EdgeR for example) that do not rarefy but do transform your data instead. And the most recent class of tools simply use relative abundance data which means you don’t need to rarefy or transform your data separately (example DEICODE, ANCOM, gneiss, breakaway, corncob etc). That being said, just because you don’t need to rarefy or transorm your data for these tools doesn’t mean the quality of a sample with 1,000 reads is equal to one with 10,000. The latter is still much more reliable and the former still comes with quite a bit of bias and uncertainty. Further, filtering to remove rare features is recommended, perhaps even needed, in some of these tools (example ANCOM, gneiss) but not others (q2-breakaway). So at the end of the day you need to make sure that the method you are using somehow is dealing with unequal sampling depth, and you cater your data specifically for each tool’s assumptions and requirements. Currently there are no normalization methods implemented in qiime2 (but hopefully there will be soon), you can rarefy your data using the feature-table rarefy command, or when using the core-metrics command you are required to provide a minimum read threshold using the --p-sampling-depth parameter. Hope that helps!

1 Like

Many thanks @Mehrbod_Estaki, what a fantastic explanation! My problem in fact is similar to your example. We have a few samples around 1000 reads/sample, and we don´t know how to manage them. We always perform a filtering step of very low abundance features (<7 reads/all samples), but this problem with samples with low number of reads is so challenging. Here, we don’t expect a lot of reads since samples came from healthy human lung biopsy. We will check if by eliminate some of those samples we obtain the same differential abundance results.
Thanks a lot

1 Like

Hi @KirKara,
You are totally right that this is a very challenging issue indeed but I do think you are on the right path. Understanding the biology of your question is very important and I think your approach is very sound indeed as it is extremely difficult to develop guidelines and general rules that work in ever case. For example 1,000 reads from a sample source that has very low diversity is a lot more reliable than 1,000 reads from a high diversity source, so it’s possible that with 1,000 reads you do actually capture the overall diversity of your healthy lung samples and won’t need to remove those samples. I think your elimination and validate approach is a great idea! Good luck.


This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.