Using non-integer data with beta diversity

I'm working with @vrbana on some metabolomics data. These data are expressed as floating point values and do not make sense to express as integers. When using qiime diversity beta, we were getting tracebacks indicating that Nans were in the resulting distance matrix. Nothing unusual stood out with the input FeatureTable (e.g., no empty samples, no empty observations, no duplicate IDs, no extreme values, no negatives, etc). On closer inspection, we realized that q2_diversity.beta casts the FeatureTable data to int. This implicit cast truncates floating point values, and in the case of @vrbana's table, coerced three samples to get zero'd producing Nan on output.

A minimally reproducible example is below using the skbio diversity method directly.

We thought it would make sense to open this up on the forum instead of a bug report as it's not clear if this is a bug as many of these diversity metrics were originally defined over count data. However, some of these metrics have been quite valuable on data that do not make sense to represent as counts. Given this, should QIIME2 support non-integer data for the diversity metrics?

In [21]: import numpy as np

In [22]: import skbio

In [23]: a = np.array([[0.1, 0.2, 3], [0.2, 0.3, 0.4], [0.5, 0.6, 0.7]])

In [24]: a.astype(int)
Out[24]:
array([[0, 0, 3],
[0, 0, 0],
[0, 0, 0]])

In [25]: skbio.diversity.beta_diversity('braycurtis', a, validate=False).data
Out[25]:
array([[ 0. , 0.66666667, 0.60784314],
[ 0.66666667, 0. , 0.33333333],
[ 0.60784314, 0.33333333, 0. ]])

In [26]: skbio.diversity.beta_diversity('braycurtis', a.astype(int), validate=False).data

DistanceMatrixError Traceback (most recent call last)
in ()
----> 1 skbio.diversity.beta_diversity('braycurtis', a.astype(int), validate=False).data

~/miniconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/skbio/diversity/_driver.py in beta_diversity(metric, counts, ids, validate, pairwise_func, **kwargs)
372
373 distances = pairwise_func(counts, metric=metric, **kwargs)
--> 374 return DistanceMatrix(distances, ids)

~/miniconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/skbio/stats/distance/_base.py in init(self, data, ids)
105 ids = tuple(ids)
106
--> 107 self._validate(data, ids)
108
109 self._data = data

~/miniconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/skbio/stats/distance/_base.py in _validate(self, data, ids)
870 if (data.T != data).any():
871 raise DistanceMatrixError(
--> 872 "Data must be symmetric and cannot contain NaNs.")
873
874 if np.trace(data) != 0:

DistanceMatrixError: Data must be symmetric and cannot contain NaNs.

1 Like

Hey @wasade and @vrbana!

Gross!

I think so! But I also think methods like beta_diversity are perhaps trying to accomplish a little too much to handle every possible case. The trouble is really that the input's type depends on your metric.

For my 2-cents, I think what we might want to see is a method called braycurtis which accepts something like: FeatureTable[Frequency | Continuous].

beta_diversity could stick around expecting Frequency, but you'd be able to do more specific things with the more specific methods.

This would also work well for qualitative metrics which could accept PresenceAbsence tables.

2 Likes

We may want to at least warn here. We could probably test within the code and warn/comment? Something like:

    test_values = table.matrix_data.data[:100]
    if np.allclose(test_values, test_values.astype(int)):
        safe_as_count = True
    else:
        safe_as_count = False
        warn("Your data do not appear to be counts")

I think this makes sense and completely agree w/ PresenceAbsence too.

Going a little further though, to run floating point values through beta_diversity right now, we need to pass validate=False otherwise we trigger an exception in _validate_counts_vector.

@ebolyen, let me know if you'd like this to shift to a github issue. I think we need to discuss within q2-diversity as well as on scikit-bio's tracker.

Best,
Daniel

1 Like

Maybe, but it sounds like your artifact was the wrong semantic type to start with (e.g. wasn't really FeatureTable[Frequency], not that there's a better option yet), so on some level the fact that it acts so poorly isn't exactly its fault. Right now I think our assumption is that the semantic type acts as the guard for a method. This is certainly true for the primitive types.

It sounds like scikit-bio may need some updates then. Only sort of related, @thermokarst and I were thinking about dissecting the dependency version caps to see if we can get the entire stack back up to latest versions of everything, so we could probably coordinate those changes with this and get a reasonable release out of it!

Yup, I am good with that, although this thread is also probably useful to coordinate between the two trackers.

Having more concrete diversity metric methods also ties into providing better citations for those things, so I think we have the potential to kill a few birds with one stone.

1 Like

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.