Filter feature table phylogenetically

Stefan · June 6, 2018, 12:32pm

The SEPP function of the q2-fragment-insertion plugin takes a set of representative sequences and a reference phylogeny as inputs and returns another phylogenetic tree with additional tips. Representative sequences are typically the features of a Deblur / DADA2 feature-table. A low percentage of representative sequences gets sometimes rejected by SEPP, because of too low sequence similarity to the reference. Thus, users would need a function to filter their feature-table to only include those features integrated into the phylogenetic tree by SEPP (otherwise downstream analyses like diversity calculation will fail).

Programming such a function isn’t that hard, but I do have some design questions that I would like to discuss with you:

since this is a quite general function, would it be better suited in the feature-table plugin?
any suggestions for the function name? We have filter-features in the feature-table plugin. How about filter-features-phylogeny?
The user might be interested in the ratio of reads of the rejected features. How should the function report about this? I see several options:
a) completely ignore and leave it to the user to compare original and filtered table
b) return two tables, one for the kept features, one for the removed features. Thus, it would be possible to get maximal insights into filtered reads per sample per feature
c) report summaries on stdout, like absolute number and ratio of lost reads per sample

I am happy about every comment, suggestions, critique!

thermokarst · June 7, 2018, 1:57pm

Hey there @Stefan! The need for this kind of filtering makes sense! I wonder if we can approach this via a different route though - instead of adding a new Action to some plugin, what if we take advantage of an existing action - feature-table filter-features? This filtering method currently allows users to filter based on metadata IDs, which is pretty general purpose. What if we defined a new transformer that allows us to view Phylogeny[Rooted | Unrooted] artifacts as QIIME 2 Metadata? We could transform the phylogenetic tree into, say, a list of feature IDs, and if there was any other information worth including (like, length, we could stash that in the metadata, too). Then, we would get a few things for free: basically anywhere that metadata is accepted (or, feature metadata), a phylogenetic tree could be passed in, too! Thoughts? I might be barking up the wrong tree here, or overlooking something, so feel free to correct me!

cc @ebolyen - do you have any thoughts on this?

Stefan · June 8, 2018, 6:32am

Hi @thermokarst, as a long term Haskell user, I am a big fan of strong typing and type semantics. I feel that “converting” a Phylogeny type into a Metadata type would be a harsh violation of this concept.

Also, forcing the user to call two functions seems to be too inconvenient.

I would expose another function to the user that takes the feature-table and the phylogeny as input, collects tips from the phylogeny and internally calls the feature-table filter-features function. But that is only my opinion and maybe I don’t consider all long term maintenance aims of QIIME2.

thermokarst · June 8, 2018, 12:40pm

Us too! That is why QIIME 2 has a Semantic Type system (while extending that idea into things like Formats and transformers)!

I suspect I haven't done a good job explaining my proposal of identifier based filtering, or how it would work, under-the-hood so to speak, because I think we are actually on the same page when it comes to some of the mechanics of how this could work.

Right now, qiime feature-table filter-features has an optional parameter for metadata-based filtering:

  --m-metadata-file MULTIPLE PATH
                                  Metadata file or artifact viewable as
                                  metadata. This option may be supplied
                                  multiple times to merge metadata. Feature
                                  metadata used with `where` parameter when
                                  selecting features to retain, or with
                                  `exclude_ids` when selecting features to
                                  discard.  [optional]

So, you can use a traditional Metadata TSV file here, or, you can provide an "artifact viewable as metadata". The first option (TSV-style) is pretty clear how that works, I think, but the second is a little more interesting to me. Artifacts viewable as metadata retain their semantic type, but through the transformation system, are viewed by the filter-features method as Metadata! Nothing has been converted or modified of the user's original data.

Here is what that looks like right now:

Filtering with a traditional TSV metadata file:

qiime feature-table filter-features \
  --i-table table.qza \
  --m-metadata-file feature-metadata.tsv \
  --o-filtered-table filtered-table.qza

Filtering with a FeatureData[Taxonomy] artifact (this is currently supported, because the format that represents this type is viewable as metadata):

qiime feature-table filter-features \
  --i-table table.qza \
  --m-metadata-file taxonomy.qza \
  --o-filtered-table filtered-table.qza

This is still only a one-step command for the user, there is no need for them to "convert" their taxonomy data beforehand - the type system, transformer system, and formats, all know how to work together with the view API to make this happen! In the plugin, the registered method's signature looks for qiime2.Metadata, so it receives a consistent object every time.

So, if we defined a transformer for for converting a phylogeny format to Metadata:

@plugin.register_transformer
def _1(data: NewickFormat) -> qiime2.Metadata:
    data = _util_to_load_and_convert_tree_to_table(data)
    df = pd.Dataframe(data)
    # The df index would be the tip IDs
    return qiime2.Metadata(df)

The transformer above would basically do what you proposed above:

plus, whatever else might make sense generally.

Then, any user interested in filtering their feature table based on the IDs present in a phylogenetic tree (Phylogeny[Rooted | Unrooted]) could run the following:

qiime feature-table filter-features \
  --i-table table.qza \
  --m-metadata-file tree.qza \
  --o-filtered-table filtered-table.qza

So that would be a one-stop-shop for them, they would get to retain the tree, untouched, but, the filter-features method would be able to grab the IDs out of tree in a consistent manner. Plus, Phylogeny[Rooted | Unrooted] artifacts would now generally be viewable as metadata, which means that other methods that can consume metadata for their work (often utilizing IDs for coordination) can now take advantage of this! This also means that there is only one place in the code that is responsible for creating a dataframe of tip IDs, rather than implementing in individual methods. Transformers are global in the QIIME 2 ecosystem.

I hope I have made my proposal a bit more clear, but if not, @ebolyen can probably help answer any more questions or concerns! Thanks for entertaining this discussion!

Stefan · June 8, 2018, 1:01pm

Thanks for this longer explanation. Now it makes sense and does no longer give me a bad feeling about your path. Still I have two open questions:
In general, you are saying that we don’t need to worry about the filtering method, but the transformation to make a Phylogeny metadata-viewable.

Shouldn’t we then add this transformer to the phylogeny plugin?
Thinking about the two dimensional character of the transformed data, I assume we want to have features as rows. Does that include internal nodes? What happens if qiime feature-table filter-features gets features as input which do not exist in the --i-table? What other information, besides the node names, should we add as additional columns?

ebolyen · June 8, 2018, 4:34pm

Hi @Stefan and @thermokarst!

I think that was a good explanation!

We tend to centralize formats/types/generally useful transformers in the q2-types plugin to make inter-plugin dependencies easier to deal with. Everyone can depend on just q2-types instead of having a tangle of dependencies (mostly...).

This is where I'm also not 100% certain about the idea, things like FeatureData[...] and SampleData[...] by their nature tend to be compatible with a tabular representation, but a table for a Phylogeny[...] does seem a little unnatural.

It seems like the key idea is really the IDs in which case it kind of hearkens back to some earlier ideas of having an "Index" of sorts which could be used generically (just like a table of data). In this case, the tips of the tree.

thermokarst · June 14, 2018, 1:02pm

Thanks @ebolyen!

Yep, totally agree - I just wanted to make sure we explored this option!

Lichen · October 18, 2018, 1:25am

Hello,

I’ve run into this issue and just wanted to check in to see if there was anything released to perform this function.

Thanks in advance.

Stefan · October 18, 2018, 8:05am

The latest version of the q2-fragment-insertion plugin should ship with a function qiime fragment-insertion filter-features. Try qiime fragment-insertion filter-features --help to get information about how to use it and let me know if it is useful or where / how I should modify.
Thank
Stefan

Lichen · October 19, 2018, 3:03am

Hi Stefan,

Thanks for your note and the information. This is most useful and I don’t yet have suggestions for modifications.

Best wishes,

Justin