I’m raising this question to start the discussion about future directions regarding semantic typing.
Part of this is stemming from inconsistencies that have risen from the FeatureData[Differential]
type that is used by Songbird + Aldex2. The other part is coming from the current limitations of types that makes it difficult to add new features (i.e. hypothesis tests for differential abundance tests or Bayesian credible intervals).
To start with the current inconsistencies with the Songbird / Aldex2 output, Aldex2 outputs a p-value for each microbe (in their case using a robust estimate of the mean as the reference). Songbird doesn’t output p-values, deferring those calculations to Qurro. This by itself causes issues with the FeatureData[Differential]
type. Aldex2 can handle multiple categories, and will essentially end up with a table of log-fold changes, pvalues for each microbe (@dgiguer, feel free to jump in) - so a table of d rows (for d microbes) with at least 2 columns : log fold changes and p-values for a single covariate. Songbird will can handle multiple covariates with an additive structure, so it can handle blocked designs, continuous variables, … - leading to a table of d rows (for d microbes) by k covariates (i.e. sex, age, BMI, body_site, environment, whatever…) of log-fold changes.
These outputs are very similar, but are currently incompatible in the FeatureData[Differential]
– for the main reason that this data type cannot handle higher order tensors. The more elegant solution will be to have a data type that can store (microbes) x (covariates) x (sample-statistics) in a 3D tensor.
This will also handle the emerging use-case with Bayesian statistics, where rather than having a single statistic generated, the uncertainty can be measured from multiple Monte Carlo samples. So it would be in the shape of (microbes) x (covariates) x (Monte Carlo samples). As you can imagine, this will open up many more possibilities of producing these data types to be readily consumed by other statistical aggregators or visualizers (plus they can be very expensive, Monte Carlo samples can easily eat up gigabytes of memory across dozens of samples).
And things can, and will get more complicated once you start throwing in microbe-microbe correlations, multi-omics and time series - time series methods will require 4D tensors with (microbes) x (covariates) x (MC samples) x (time points) at least, possibly 5D tensors for microbe-microbe interactions with (microbes) x (microbes) x (covariates) x (MC samples) x (time points).
The existing qiime2 interface has provided an incredible backbone for microbial ecology. However as more advance methods get developed, we will need to start thinking about how to adapt the types accordingly.
I think a good topic to consider discussing is introducing the concept of a FeatureTensor
, which can be subclassed similarly to the existing FeatureData
. But I think this will require careful discussion, since there are a number of possibilities to consider when representing these FeatureTensor
types in memory or on disk – tensors can become large very quickly. Namely, what are reasonable ways to store dense / sparse tensors? (i.e. pytables, sparse coo format, feather, xarray or zarr).
Incorporating statistical / ML methodologies has been difficult in qiime2, largely because it has traditionally required implementing new data types for the specific use-case. Solidifying a core data-type that can largely encapsulate these outstanding needs is critical in extending these ecosystem to ensure that statistical outputs can be produced by computationally intensive methods, and be readily consumed by other plugins.
Thoughts? @ebolyen, @thermokarst, @Nicholas_Bokulich, @fedarko, @cmartino, @wasade, @yoshiki, @dgiguer, @jwdebelius, @Mehrbod_Estaki, @SoilRotifer, @gibsramen, @gwarmstrong?