I'm working with @vrbana on some metabolomics data. These data are expressed as floating point values and do not make sense to express as integers. When using qiime diversity beta
, we were getting tracebacks indicating that Nan
s were in the resulting distance matrix. Nothing unusual stood out with the input FeatureTable
(e.g., no empty samples, no empty observations, no duplicate IDs, no extreme values, no negatives, etc). On closer inspection, we realized that q2_diversity.beta
casts the FeatureTable
data to int
. This implicit cast truncates floating point values, and in the case of @vrbana's table, coerced three samples to get zero'd producing Nan
on output.
A minimally reproducible example is below using the skbio
diversity method directly.
We thought it would make sense to open this up on the forum instead of a bug report as it's not clear if this is a bug as many of these diversity metrics were originally defined over count data. However, some of these metrics have been quite valuable on data that do not make sense to represent as counts. Given this, should QIIME2 support non-integer data for the diversity metrics?
In [21]: import numpy as np
In [22]: import skbio
In [23]: a = np.array([[0.1, 0.2, 3], [0.2, 0.3, 0.4], [0.5, 0.6, 0.7]])
In [24]: a.astype(int)
Out[24]:
array([[0, 0, 3],
[0, 0, 0],
[0, 0, 0]])In [25]: skbio.diversity.beta_diversity('braycurtis', a, validate=False).data
Out[25]:
array([[ 0. , 0.66666667, 0.60784314],
[ 0.66666667, 0. , 0.33333333],
[ 0.60784314, 0.33333333, 0. ]])In [26]: skbio.diversity.beta_diversity('braycurtis', a.astype(int), validate=False).data
DistanceMatrixError Traceback (most recent call last)
in ()
----> 1 skbio.diversity.beta_diversity('braycurtis', a.astype(int), validate=False).data~/miniconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/skbio/diversity/_driver.py in beta_diversity(metric, counts, ids, validate, pairwise_func, **kwargs)
372
373 distances = pairwise_func(counts, metric=metric, **kwargs)
--> 374 return DistanceMatrix(distances, ids)~/miniconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/skbio/stats/distance/_base.py in init(self, data, ids)
105 ids = tuple(ids)
106
--> 107 self._validate(data, ids)
108
109 self._data = data~/miniconda3/envs/qiime2-2018.4/lib/python3.5/site-packages/skbio/stats/distance/_base.py in _validate(self, data, ids)
870 if (data.T != data).any():
871 raise DistanceMatrixError(
--> 872 "Data must be symmetric and cannot contain NaNs.")
873
874 if np.trace(data) != 0:DistanceMatrixError: Data must be symmetric and cannot contain NaNs.