Considerations and downstream implications of defining new output SemanticTypes

cduvallet · January 29, 2018, 10:17pm

I have specific question related to developing the percentile normalization plugin, but think it also opens a more general conversation about downstream compatibility considerations when developing plugins.

What are the consequences of defining new semantic types as outputs to custom plugins? For example, in our plugin the output data is a feature table with OTUs in samples converted to percentiles of their respective distribution in controls (more info in our preprint). So technically, the output should not be a FeatureTable[RelativeFrequency] and should instead be something like a FeatureTable[PercentileNormalized]. However, I'm wondering what the downstream implications of that are.

For example, we'll want users to be able to use these feature tables in most downstream applications (e.g. differential abundance testing, PCA plotting, etc). Will creating a new output type require many changes to downstream functions to allow for compatibility? Are there any significant drawbacks to having this output be a generic FeatureTable[RelativeFrequency] even though that's not technically what it contains?

Thanks!

ebolyen · January 30, 2018, 4:18pm

Hi @cduvallet!

These are awesome questions!

The implication is that your output will be guarded from entering methods which do not understand that particular type. This is usually a really good thing.

It will, but only in the most trivial way. Something that might use FeatureTable[RelativeFrequency] would now accept FeatureTable[RelativeFrequency | PercentileAbundance] (if that made sense for that particular action).

This is essentially what we expect to happen over time. As new types emerge, it isn't always certain that a given type makes sense for a given action (even if the underlying representation is compatible, say a fasta file or a biom table). So on one hand, we want to stop misuse of data, and on the other, we don't want to make it so inconvenient to adapt to new techniques that the ecosystem starts to use overly-broad types (making the types meaningless). We hope that having to add something like "| PercentileAbundance" meets that middle-ground of specific, but simple to update.

You would lose the ability to discriminate between them. It depends on if that is important or not. There is also another trick we can use depending on how PercentileAbundance relates to RelativeFrequency.

If every instance of a PercentileAbundance can be considered a RelativeFrequency, but not every table that is RelativeFrequency can be used as PercentileAbundance (i.e. the set of tables that are PercentileAbundance is a strict subset of tables that are RelativeFrequency), then you can use this notation:

                                            # any string/name is fine
FeatureTable[RelativeFrequency] % Properties("percentile_normalized")

Which would allow every method which accepts FeatureTable[RelativeFrequency] to use your output, while still being able to determine those which were not percentile normalized.

IMPORTANT: This feature is relatively unused at the moment, and the syntax is definitely subject to change. We haven't identified a lot of situations where this is necessary, but I imagine it's useful when you need to "tighten up" the ontology. Any feedback or discussion on this would be awesome!

The following is also possible, but it sounds like it's really something that applies to the table as a whole rather than the observations themselves, so the above makes more sense than this:

FeatureTable[RelativeFrequency % Properties("percentile_normalized")]

Let me know if that makes sense, and thanks for getting this discussion started!

cduvallet · February 2, 2018, 3:57pm

Yes, very helpful response thanks @ebolyen!

After talking with my collaborator, I think we'll go ahead and make a new type because only a few downstream applications would, in fact, be appropriate for percentile-normalized data.

That said, it seems like users who make new output data types will need to go through the code for all the available core functions and plugins and add "| PercentileNormalized" to each relevant function's input, right? For now, this seems feasible - there aren't that many functionalities or plugins (and like I said, in our particular case there are actually on a very small number of downstream things you should do with percentile-normalized data). As qiime2 grows, do you foresee this becoming an unreasonable burden?

I'm not a software engineer so maybe this is a simpler problem than it seems to me, but I was wondering what your plans are for these kinds of edits downstream.

ebolyen · February 5, 2018, 9:38pm

That sounds perfect!

That definitely remains to be seen, but updating an annotation is certainly simpler than updating functional code. Like you mentioned, it wouldn't make sense for users to have done many of the other downstream methods. If QIIME 2 didn't require someone to change something, then users would have to keep track of all of this (like in QIIME 1).

We're hoping developers are in the best position to know what makes sense as input, and we hope that getting happy users, provenance tracking, and lots of different interfaces are compelling enough reasons to justify the cost of dealing with these "ontology shifts/extensions" once in a while. ¯\_(ツ)_/¯

cduvallet · April 12, 2018, 12:09am

Okay, back to this! I'm planning to define a new data type for my plugin output. I think I've gotten most of the way there, but am stuck with an error when I try running the installed plugin.

To double-check my own work (and perhaps be useful to future developers), here's what I did:

In my plugin_setup.py script, I first set up my plugin:

import qiime2.plugin
from qiime2.plugin import SemanticType
from q2_types.feature_table import FeatureTable, BIOMV210DirFmt

plugin = qiime2.plugin.Plugin(
    name='perc_norm',
    version=q2_perc_norm.__version__,
    short_description='Plugin for percentile-normalizing case-control data.',
...
)

Then I defined a new type which is a variant of the existing type FeatureTable:

PercentileNormalized = SemanticType('PercentileNormalized',
    variant_of=FeatureTable.field['content'])

Finally, I register this new type:

plugin.register_semantic_type_to_format(FeatureTable[PercentileNormalized],
    artifact_format=BIOMV210DirFmt)

This seems to work - when I install my plugin (python setup.py install), I don't get any errors.

However, when I try to run the plugin on some test data, I get the following error:

claire:~/github/q2-perc-norm/test_data$ qiime perc-norm percentile-normalize --i-table test_otu_table.qza --m-metadata-file test_metadata.txt --m-metadata-column DiseaseState --o-perc-norm-table test_out.percentile_qiime.percnorm_format.qza

Plugin error from perc-norm:

  Name 'PercentileNormalized' is not a defined QIIME type, a plugin may be needed to define it.

What else do I need to do in order to define this new type? Is there anything in base qiime2 that needs to be edited?

Are there other user-developed plugins that define new types that I could use as an example to go off of?

ebolyen · April 13, 2018, 2:53pm

Hi @cduvallet,

There's an extra registration you have to do on the semantic type component:

plugin.register_semantic_types(PercentileNormalized)

We used to have a few, but those registrations have since moved into q2-types so they can be shared (without getting into an import loop, or making a very complicated import chain).

You can see an example of q2-types registering the FeatureTable stuff here.

cduvallet · April 13, 2018, 3:18pm

Aha, that did it! Looks like everything works now, thanks!

cduvallet · April 18, 2018, 6:19pm

Okay, so now that I've defined my new output type, how should I go about updating the downstream functions to accept FeatureTable[PercentileNormalized] data?

Should I go in to the main QIIME 2 codebase and edit the inputs to relevant functions and then do a pull request to integrate the changes, or is there a different/better/preferable way?

ebolyen · April 22, 2018, 3:29pm

Hey @cduvallet,

Sorry for the delayed response on this.

I think these are the steps:

Figure out what functionality should work with the PercentileNormalized variant
Move the registration/declaration of PercentileNormalized to q2-types so that every plugin in step 1 can import the type without introducing any inter-plugin dependency chains or cycles. (This is pretty much the only reason q2-types exists, it's just a shared vocabulary between the plugins that us it.)
Add FeatureTable[<whatever was there> | PercentileNormalized] to everything in step 1.

cduvallet · August 15, 2018, 5:48pm

Finally coming back to this, hopefully will be able to make these changes before the next release!

Just to check that I do things in a way that doesn't make your life more difficult: I should just go through each of the relevant q2 repos, fork them, make the type declaration edits, and do a PR? Is there a different/easier way than this?

ebolyen · August 15, 2018, 6:01pm

Nope, that sounds like the ideal workflow! Sorry it will be a bit of work on your end, but it will be exciting to see this functionality composed with the rest of the ecosystem!

Let's start with the q2-types registration and go from there