Differential abundance testing is a critical task in microbiome studies that is complicated by the sparsity of data matrices. DS-FDR can achieve higher statistical power to detect significant findings in sparse and noisy microbiome data compared to the commonly used Benjamini-Hochberg procedure and other FDR-controlling procedures.
including a list of differential abundant taxa, test statistics and raw p-values
$ qiime tools view dsfdr.qzv
Interpret the results
The first column of the results represent all the taxa in your data, and the second column with values of FALSE/TRUE indicate whether the corresponding taxa is found to be statistically significant between your interested groups. TRUE suggests statistically significant. The third column provides the values of test statistics for the testing on each taxa, which can be served as a proxy of the effect size. The fourth column is the raw p-values for each taxa. Note that these raw p-values were the p-values before FDR correction. The corrected p-values are not available for this DS-FDR method, as we use test statistics instead of p-values for the estimation of False Discovery Rate.
Source Code
If you want to look deep into the DS-FDR method, the source code in python is available here
Thanks for this community tutorial, and for the q2-dsfdr tutorial @serenejiang! I noticed while testing this out that the resulting artifact (haddad.dsfdr.qza) is of semantic type SampleData[AlphaDiversity]. This is misleading as this isn’t alpha diversity data, and the differential abundance test is based on features not samples, so these are feature data (not sample data). This could therefore end up not being usable by researchers wanting to integrate this with other QIIME 2 feature data.
Would you be willing to update your plugin to define a new semantic type for this data? That could be something like FeatureData[DS-FDR-differentially-abundant], but other folks who are more familiar with this method might have a better idea. Also, are there stats that would be useful for users to have access to? As far as I can tell, this only outputs a boolean value for each feature - it would be helpful to expand on this if possible, to give users some more information to use in interpreting the results of the run.
@gregcaporaso, very good point - and this is something that I believe that @ebolyen and I talked about in the past about metadata.
Something that we may want to be cautious about is creating a whole bunch of semantic types that later become difficult to manage. Imagine if we had to create semantic types for every possible sample metadata category?
It may make sense to have this as a column feature metadata. This will make is really nice to incorporate into the upcoming biplots and tree visualizations, since we could literally just pass in a whole table of metadata for the features.
No strong feelings, but just something to keep in mind for future development.
Thanks for making those changes @serenejiang, I think these make the plugin more useful and usable.
I have a few other suggestions for improvements, but these aren't urgent (just think of these as suggestions for how this can be improved in the future).
If you make two minor changes to your csv file format, the resulting file would be viewable/usable as QIIME 2 feature metadata, which would allow for integration of these results in the upcoming biplots and tree visualizations as @mortonjt suggested would be useful. I recommend making this change now, as it's really simple and it will help users integrate dsfdr into their workflows (which means more users of q2-dsfdr and more citations for your paper). These changes are: replace commas with tabs (i.e., make this tab-separated text instead of comma-separated text); and include feature-id as the header for the first column. I've attached a modified version of this file that includes those changes as an example (dsfdr.tsv (138.3 KB)). I also ran qiime metadata tabulate on this file, which illustrates one way that it could be viewed as QIIME 2 feature metadata. The command I ran was qiime metadata tabulate --m-input-file dsfdr.tsv --o-visualization dsfdr-as-feature-metadata.qzv and it produced this visualization: (dsfdr-as-feature-metadata.qzv (1.2 MB)).
You don't have your citation associated with this visualizer or plugin:
$ qiime dsfdr --citations
No citations found.
$ qiime dsfdr permutation-fdr --citations
No citations found.
See here and here for an example of how to add a citation to your plugin. Without this information, users will have a harder time knowing how to cite your work.
I would recommend displaying your output table in the visualization that you generate, rather than just providing a download link. This isn't essential, but I imagine that users will like to see that.
Does the sign of the test statistic indicate which group a feature is more abundant in? I can see from this output that there are many differentially abundant features, but I don't know which group they are more abundant in. It would be good to discuss how the user can get that information in your Interpret the results text. I personally like how the ANCOM visualization provides a five-number summary of the abundance of each feature in each group. This also isn't essential, but it might save you time in providing user support for q2-dsfdr.