I am attempting to follow along with the 'Differential Abundance Analysis with gneiss" tutorial and adapting for interpretation of my own data set. I am able to execute the steps to calculate balances, but when I reach the regression portion I input the following commands:
Each time I've run this step I receive the following error:
Plugin error from gneiss:
cannot convert float NaN to integer
Debug info has been saved to /tmp/qiime2-q2cli-err-mmy0s8cz.log.
Has anyone else encountered this issue? I am not sure where the NaN float values are being generated, so any advise on where to look would be greatly appreciated!
Hi @Seth - we need a few more pieces of information in order to help you!
What version of QIIME 2 are you using?
Can you please provide the complete error log? Either re-running with --verbose, or attaching the log file saved (e.g. /tmp/qiime2-q2cli-err-mmy0s8cz.log)
If possible, can you please provide your sample metadata file? This could be directly messaged to me, if you didn't want to share publicly on the forum.
Thanks for providing your metadata via a DM! I suspect this failure is related to missing values in the following columns:
Fraction, Reactor, and Growth_Phase_Verb (3 of the 4 columns you specified in --p-formula.
One trick for investigating your metadata is to use the metadata tabulate command, then you get a simple visualization of your metadata, which supports basic filtering and sorting (I used this to sort your columns to see if there were any blanks). Check out the Metadata Tutorial when you get a chance, it is a thrilling read!
Moving forward, you could either add some kind of BLANK sentinel (whatever value makes sense to you), or maybe it makes sense to remove those samples from your analysis? You will need to investigate to do what is "right" for your study.
@mortonjt, do you have anything to say about this? Thanks!
It's nice to hear from you both (particularly after reading many of your posts). I tried again to troubleshoot using your suggestions.
I was surprised to see those empty cells when I ran the tabulate command. When I look at the same tab delimited file in excel or notepad++ I see "NA" in these fields. Although I didn't find it in the documentation, I thought maybe NA is a problem, and sure enough replacing these with 'Blank' does fill the cells in the tabulate output.
After fixing the metadata file I re-ran the model build step with columns that seem to all be filled in the tabulate visualization. Unfortunately I still get a similar error as shown below.
/opt/conda/lib/python3.5/site-packages/gneiss-0.4.1-py3.5.egg/gneiss/regression/_ols.py:193: RuntimeWarning: invalid value encountered in true_divide
/opt/conda/lib/python3.5/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
return (self.a < x) & (x < self.b)
/opt/conda/lib/python3.5/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
return (self.a < x) & (x < self.b)
/opt/conda/lib/python3.5/site-packages/scipy/stats/_distn_infrastructure.py:1818: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= self.a)
/opt/conda/lib/python3.5/site-packages/gneiss-0.4.1-py3.5.egg/gneiss/regression/_ols.py:192: RuntimeWarning: invalid value encountered in sqrt
/opt/conda/lib/python3.5/site-packages/statsmodels/stats/multitest.py:320: RuntimeWarning: invalid value encountered in less_equal
reject = pvals_sorted <= ecdffactor*alpha
Traceback (most recent call last):
File "/opt/conda/lib/python3.5/site-packages/q2cli/commands.py", line 222, in __call__
results = action(**arguments)
File "<decorator-gen-151>", line 2, in ols_regression
File "/opt/conda/lib/python3.5/site-packages/qiime2/sdk/action.py", line 201, in callable_wrapper
output_types, provenance)
File "/opt/conda/lib/python3.5/site-packages/qiime2/sdk/action.py", line 392, in _callable_executor_
ret_val = callable(output_dir=temp_dir, **view_args)
File "/opt/conda/lib/python3.5/site-packages/q2_gneiss/regression/_regression.py", line 27, in ols_regression
ols_summary(output_dir, res, tree)
File "/opt/conda/lib/python3.5/site-packages/gneiss-0.4.1-py3.5.egg/gneiss/plot/_regression_plot.py", line 291, in ols_summary
hm_p = _heatmap_summary(model.pvalues.T, model.coefficients().T)
File "/opt/conda/lib/python3.5/site-packages/gneiss-0.4.1-py3.5.egg/gneiss/plot/_regression_plot.py", line 186, in _heatmap_summary
ind = int(np.floor((x - _min) / (_max - _min) * (N - 1)))
ValueError: cannot convert float NaN to integer
Plugin error from gneiss:
cannot convert float NaN to integer
See above for debug info.
One related question. The above code is using the full data set, but I had been trying to do this on a filtered subset of samples. I started by filtering my feature table, and thought this might be part of the problem when using an un-filtered metadata file. Should this approach work (assuming I can resolve the current issue)?
Also, what does the balances.qza table look like? Are there a whole bunch of infs? It may be possible that the add-pseudocount command wasn't run before hand, and a bunch of log(0) values were generated.
You can peak at the balances.qza using qiime tools export balances.qza --output-dir balances_dir
I just tried using only Fraction, and I see a very similar error response.
/opt/conda/lib/python3.5/site-packages/gneiss-0.4.1-py3.5.egg/gneiss/regression/_ols.py:193: RuntimeWarning: invalid value encountered in true_divide
/opt/conda/lib/python3.5/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
return (self.a < x) & (x < self.b)
/opt/conda/lib/python3.5/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
return (self.a < x) & (x < self.b)
/opt/conda/lib/python3.5/site-packages/scipy/stats/_distn_infrastructure.py:1818: RuntimeWarning: invalid value encountered in less_equal
cond2 = cond0 & (x <= self.a)
/opt/conda/lib/python3.5/site-packages/statsmodels/stats/multitest.py:320: RuntimeWarning: invalid value encountered in less_equal
reject = pvals_sorted <= ecdffactor*alpha
Traceback (most recent call last):
File "/opt/conda/lib/python3.5/site-packages/q2cli/commands.py", line 222, in __call__
results = action(**arguments)
File "<decorator-gen-151>", line 2, in ols_regression
File "/opt/conda/lib/python3.5/site-packages/qiime2/sdk/action.py", line 201, in callable_wrapper
output_types, provenance)
File "/opt/conda/lib/python3.5/site-packages/qiime2/sdk/action.py", line 392, in _callable_executor_
ret_val = callable(output_dir=temp_dir, **view_args)
File "/opt/conda/lib/python3.5/site-packages/q2_gneiss/regression/_regression.py", line 27, in ols_regression
ols_summary(output_dir, res, tree)
File "/opt/conda/lib/python3.5/site-packages/gneiss-0.4.1-py3.5.egg/gneiss/plot/_regression_plot.py", line 291, in ols_summary
hm_p = _heatmap_summary(model.pvalues.T, model.coefficients().T)
File "/opt/conda/lib/python3.5/site-packages/gneiss-0.4.1-py3.5.egg/gneiss/plot/_regression_plot.py", line 186, in _heatmap_summary
ind = int(np.floor((x - _min) / (_max - _min) * (N - 1)))
ValueError: cannot convert float NaN to integer
Plugin error from gneiss:
cannot convert float NaN to integer
See above for debug info.
When I output my balances.qza into a biom file I do see many 0.0 values, but searching reveals no infs. I did run add-pseudocount on my table at the very beginning, as show in your tutorial write up. Just to be sure, I exported my composition.qza table and found a great deal of 1.0 counts, so it seems that the step must have worked.
Ok - I'm not sure about what your table looks like, so I am not completely sure what is going on here. Is it possible to rename the columns / rows and upload that so we can investigate further?
A couple of more questions
Are there columns (i.e. balances) in the table that sum to zero?
What does the distribution of Fraction look like?
But we have seen some edge cases where excessive number of low abundance species have caused issues with regression before. I haven't personally seen it with OLS before, but it could also be worthwhile to perform filtering to remove some of the low abundance organisms.
The sample names are coded, so I would be fairly comfortable sharing the tables. The balances.qza file is too large to upload in the comment. Is that unusual? I've gone ahead and shared on dropbox (Dropbox).
I summed all of the columns from my balances table, and did not find any zeros there either.
The fraction parameter is a binary, and the samples are distributed nearly evenly between the two categories.
@Seth. Awesome -- I think I have the problem figured out.
Turns out that you have a ton of balances with zero variance.
>>> import qiime2
>>> art = qiime2.Artifact.load('balances.qza')
>>> import pandas as pd
>>> balances = art.view(pd.DataFrame)
>>> np.sum(balances.var(axis=0)==0)
191
Zero variance balances will break any statistical tests you run. To demonstrate, I picked 'y3416', and generated some random covariate x
>>> from statsmodels.regression.linear_model import OLS
>>> y = balances['y3416']
>>> x = np.arange(len(y))
>>> model = OLS(y, x)
>>> model.fit()
>>> r = model.fit()
>>> r.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y3416 R-squared: nan
Model: OLS Adj. R-squared: nan
Method: Least Squares F-statistic: nan
Date: Tue, 03 Oct 2017 Prob (F-statistic): nan
Time: 20:29:44 Log-Likelihood: inf
No. Observations: 410 AIC: -inf
Df Residuals: 409 BIC: -inf
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 0 0 nan nan 0 0
==============================================================================
Omnibus: 1.013 Durbin-Watson: nan
Prob(Omnibus): 0.602 Jarque-Bera (JB): 153.750
Skew: 0.000 Prob(JB): 4.11e-34
Kurtosis: 0.000 Cond. No. 1.00
==============================================================================
What is probably happening is that you are including a ton of features that are just not observed at all, but are still included in the table. We can run the pipeline and compute balances, but the regression results will be nonsensible - since we can't actually compute statistics on organisms that are never present.
Filtering out all of these garbage features should be the solution to this problem -- features should be seen at least once in your study. Be sure to checkout out the tutorial on how to do this!
Hooray, you're a hero Jamie! This was the issue, and now it seems quite clear. I went ahead a ran the data through the remainder of your tutorial steps after filtering the low frequency features out of the input table. The results are quite interesting, and I'm excited to use this pipeline to test some of the questions we are working on here.
This all occurred because I was working from a table of samples filtered from the full data set, leaving behind features only found in some of the unused samples. I am trying to decide if working on the balances of a sub-set of samples is valid, or if I should handle these samples separately all the way from the start of the analysis.
Anyway, thanks again for all your help. I'm sure I will be back once I've broken your algorithms again!
Just to add, the generated feature-table.biom can be observed using:
biom summarize-table -i balances_dir/feature-table.biom -o balances_dir/feature-table_summary.txt