Thoughts on Differential and Tensor types?

mortonjt · January 25, 2021, 3:30am

I’m raising this question to start the discussion about future directions regarding semantic typing.

Part of this is stemming from inconsistencies that have risen from the FeatureData[Differential] type that is used by Songbird + Aldex2. The other part is coming from the current limitations of types that makes it difficult to add new features (i.e. hypothesis tests for differential abundance tests or Bayesian credible intervals).

To start with the current inconsistencies with the Songbird / Aldex2 output, Aldex2 outputs a p-value for each microbe (in their case using a robust estimate of the mean as the reference). Songbird doesn’t output p-values, deferring those calculations to Qurro. This by itself causes issues with the FeatureData[Differential] type. Aldex2 can handle multiple categories, and will essentially end up with a table of log-fold changes, pvalues for each microbe (@dgiguer, feel free to jump in) - so a table of d rows (for d microbes) with at least 2 columns : log fold changes and p-values for a single covariate. Songbird will can handle multiple covariates with an additive structure, so it can handle blocked designs, continuous variables, … - leading to a table of d rows (for d microbes) by k covariates (i.e. sex, age, BMI, body_site, environment, whatever…) of log-fold changes.

These outputs are very similar, but are currently incompatible in the FeatureData[Differential] – for the main reason that this data type cannot handle higher order tensors. The more elegant solution will be to have a data type that can store (microbes) x (covariates) x (sample-statistics) in a 3D tensor.
This will also handle the emerging use-case with Bayesian statistics, where rather than having a single statistic generated, the uncertainty can be measured from multiple Monte Carlo samples. So it would be in the shape of (microbes) x (covariates) x (Monte Carlo samples). As you can imagine, this will open up many more possibilities of producing these data types to be readily consumed by other statistical aggregators or visualizers (plus they can be very expensive, Monte Carlo samples can easily eat up gigabytes of memory across dozens of samples).

And things can, and will get more complicated once you start throwing in microbe-microbe correlations, multi-omics and time series - time series methods will require 4D tensors with (microbes) x (covariates) x (MC samples) x (time points) at least, possibly 5D tensors for microbe-microbe interactions with (microbes) x (microbes) x (covariates) x (MC samples) x (time points).

The existing qiime2 interface has provided an incredible backbone for microbial ecology. However as more advance methods get developed, we will need to start thinking about how to adapt the types accordingly.

I think a good topic to consider discussing is introducing the concept of a FeatureTensor, which can be subclassed similarly to the existing FeatureData. But I think this will require careful discussion, since there are a number of possibilities to consider when representing these FeatureTensor types in memory or on disk – tensors can become large very quickly. Namely, what are reasonable ways to store dense / sparse tensors? (i.e. pytables, sparse coo format, feather, xarray or zarr).

Incorporating statistical / ML methodologies has been difficult in qiime2, largely because it has traditionally required implementing new data types for the specific use-case. Solidifying a core data-type that can largely encapsulate these outstanding needs is critical in extending these ecosystem to ensure that statistical outputs can be produced by computationally intensive methods, and be readily consumed by other plugins.

Thoughts? @ebolyen, @thermokarst, @Nicholas_Bokulich, @fedarko, @cmartino, @wasade, @yoshiki, @dgiguer, @jwdebelius, @Mehrbod_Estaki, @SoilRotifer, @gibsramen, @gwarmstrong?

jwdebelius · January 25, 2021, 3:57pm

Of the solutions here, I like transforms into xarray. I’ve had some issues with feather and feather stability in the past - when I was. using it pretty heavily, there were warnings not to rely on it for permanent storage.

I think my main bias between xarray and pytables is that xarray has dask integration built in. Stan doesn’t play nicely with dask, but it might be worth considering for future parallel processing. Ive found it more intuitive than other parallel processing libraries I’ve tried, and so that integration is attractive to me.

I think the one challenge with all of these is finding a way to make sure people who want to get at a text file or linked excel file can do that in some way. I can guarantee that people will want to make their own plots or just see their data and I think it’s worth planning that specification into the data.

edit: One current dask complaint: I was trying to run Stan and Dask at the same time and the jobs did not seem to get along well. Stan has its own internal system of parallelization, so there might need to be most care with the environment. Or it could just be that you shouldn’t try to run a bunch of computationally intensive tasks on your laptop while on a zoom video call for four hours

Best,
Justine

Mechah · January 27, 2021, 10:32am

Just a comment: the idea of 4D or even 5D tensors reminds me a bit on what the developers of the q2-micom plugin wrote in their tutorial: https://github.com/micom-dev/q2-micom/blob/master/docs/README.md

“MICOM models all biochemical reactions in all taxa, which means that the optimization problem MICOM solves included hundreds of thousands of variables. There are only a few numerical solvers that can solve quadratic programming problems of that scale. Right, now we support CPLEX or Gurobi”

I’m definitely not deep enough into all the statistics of Songbird and Aldex2, or differential and tensor types etc. so simply ignore my comment if numerical solvers won’t be of any help here

mortonjt · January 28, 2021, 5:16pm

Good point. @cdiener thoughts?

cdiener · January 28, 2021, 7:44pm

I think solving large poorly conditioned numerical systems is out of the scope of Qiime but some of the output types from q2-micom are indeed kind of tensors (partially sparse though).

For instance, the fluxes are (sample) x (taxon) x(flux for each reaction). Currently, this is stored in a long DataFrame (each row is sample - taxon - reaction - flux) and saved as csv which is definitely not optimal in terms of storage. This is the reason why the plugin currently only outputs transport fluxes and not all of them. This would become even worse with the intervention simulations (not yet in the plugin) where you get (sample) x (taxon) x(modification) x (flux for each reaction). So I would use those types if available.

Additional challenges may lie in finding a good abstraction that works for most use cases and finding a file format that is compatible with the Qiime requirement that it “is accessible long-term (e.g. 20 years)” since that is not the case for most serialized formats or HDF5.

cmartino · January 29, 2021, 2:47am

Good discussion @mortonjt! In Gemelli, we represent the N-dimensional tensors as numpy arrays and do all of the tensor operations in memory and write only the results which are matrices. Sparsity could certainly be an area of improvement in this use case - but folding and unfolding are hard enough with dense arrays.

gibsramen · January 30, 2021, 11:44pm

I like @jwdebelius’ idea of using xarray though admittedly I’m more familiar with it than the others mentioned. One of my thoughts is about dimension order being a concern here - for 2D data structures it’s easy and often intuitive to enforce a standard (sample x feature for FeatureTable, feature x covariate for FeatureData[Differential], etc.) but when we start introducing more I imagine this could cause some frustration when saving and loading FeatureTensor objects.

At the abstracted level I could foresee a headache figuring out which dimension is Feature in some N-dimensional tensor if not named like in xarray. Named dimensions could alleviate some of these issues.

mortonjt · February 3, 2021, 3:54am

@gibsramen, good point – perhaps for the FeatureData[Differential] type, we should have some minimal enforcement about what the dimension names should be, similar to how sampleid, sample-id, featureid, feature-id are handled in the qiime metadata / feature metadata, or how Biom has the sample and observation names.