Filtering method for FeatureData[Differential]

dwt · June 30, 2020, 5:12pm

Hi all,
As far as I can see there are not methods for filtering FeatureData[Differential] artifacts, like those created by songbird. I've run into needing this because when running qurro if you filter the table to only include your test-set samples qurro will give an error like:
"Of the 417 ranked features, 11 were not present in the input BIOM table"
because the filtering to create the test-set table removes columns that become all zero, rather than leaving them as zero.
I'd also suggest an option to switch that behavior in the table filtering because it can be unwanted in cases like this.
Devin

fedarko · July 3, 2020, 10:40pm

Hi Devin,

Thanks for the suggestions! (And sorry for the delay in responding to you, looks like I missed this.)

When developing Qurro, I always assumed that the table a user passed to Songbird would be the same as what they passed to Qurro. However, I can see now how wanting to test just your testing samples in Qurro could lead to these problems.

There are a few things here. First off, it sounds like your goal here is just seeing how your testing samples look in Qurro by themselves, as compared to both your training and testing samples? If this is the case, I think the ideal solution would be setting up something in Qurro that'd allow you to dynamically select which samples to include in the visualization. This would mean you wouldn't have to bother with manually filtering the table, etc. I've made note of this in a TODO here, although I probably won't be able to implement this for some time.

So in the meantime, although I don't think there's anything explicitly set up for filtering FeatureData[Differential]s, we can do this in Python using QIIME 2's artifact API. If you prefer R to Python it is probably possible to use QIIME2R to do something similar to this there... but I don't know R, so here I'm doing it in Python

# NOTE: this should be run from within your QIIME 2 conda environment,
# otherwise the following two imports will probably fail
from qiime2 import Artifact
import pandas as pd

# Load differentials
diffs = Artifact.load("your-differentials.qza")
diffs_df = diffs.view(pd.DataFrame)

# Load feature table
tbl = Artifact.load("your-table.qza")
tbl_df = tbl.view(pd.DataFrame)

# In order to get the table DataFrame to have feature IDs as its
# "indices" (i.e. rows), we have to transpose it. (Or we could
# transpose the differentials and use axis="columns" below)
tbl_df_t = tbl_df.T

# Filter to just the shared feature IDs between the table and differentials
f_tbl_df_t, f_diffs_df = tbl_df_t.align(diffs_df, axis="index", join="inner")

# If you want, you can check f_diffs_df.shape here to verify that
# filtering was done properly

# Now, let's re-import the filtered differentials as a Q2 artifact and
# save it to a QZA file. Note that this will break provenance info :(
# (First, though, we need to change the index name to prevent
# Q2 from yelling at us)
f_diffs_df.index.name = "Feature ID"
f_diffs = Artifact.import_data("FeatureData[Differential]", f_diffs_df)
f_diffs.save("filtered-differentials.qza")

You should be able to pass filtered-differentials.qza to Qurro, and from there things should work out. However, please note that I'm not sure that doing this sort of operation "normally" is a good idea -- this makes sense to me for the purposes of looking at just testing data, but if the table has been filtered in other ways (e.g. an entire category of samples was removed) then it might be better to just rerun Songbird at that point. (Also, this will cause the rank plot shown in Qurro to be different, since it won't have all of the ranking information computed by Songbird. However, for your data, this will just mean 11 features are missing, so probably not a huge deal.) I'll tag @mortonjt here in case he has any thoughts on that...

Just to check, you're filtering your table using qiime feature-table filter-samples or something similar, right? It looks like there's already an open issue for that plugin on adding an optional parameter to disable zero-frequency feature filtering, but it's a bit old -- I added some extra context about this to that thread.

Hope this helps!

dwt · July 6, 2020, 12:54pm

Thanks Marcus,
Dynamically selecting samples would definitely be nice, especially if we can select by metadata as usually we will have a training/test column in the metadata.
I also have some misgivings about filtering differentials, when I worked around this issue I used the artifact API to replace the columns removed by qiime feature-table filter-samples.

from qiime2 import Artifact
import pandas as pd
testing = Artifact.load("testing-table.qza").view(pd.DataFrame)
original = Artifact.load("table.qza").view(pd.DataFrame)
for sv in set(original.columns) - set(testing.columns):
    testing[sv] = 0
Artifact.import_data("FeatureTable[Frequency]", testing).save("fixed-testing.qza")