Different results for linear regression based on conda environment

jdjame · June 22, 2020, 7:02pm

Summary:

I have noticed different results of the parameters I get as output when running simple linear regression (with ordinary least squares) on the qiime conda environment and other conda environments. I am curious to know if anyone can give me a reason for this difference

Specifics:

The two environments used were a qiime2-2020.2 environment (python 3.6.7, statsmodels==0.11.1, scikit-learn==0.22.1, and numpy===1.18.1) and a standard base conda environment (python 3.7.4, statsmodels==0.11.1, scikit-learn==0.21.3, and numpy===1.17.2). My issue is that in the qiime environment, both statsmodels.api.OLS and sklearn.linear_model.LinearRegression give different values when fitted on the same input.

Sample code

The following is some code that I use to observe the differenced in both. Here, x.csv is a csv of the design matrix used to fit and and y.csv is a csv of the response variable. The aim is for the code to fit 14 parameters (using ordinary least squares) on 303 samples.

import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd

x= pd.read_csv("x.csv", delimiter=',',encoding='utf-8',index_col=0,low_memory=False)
y=pd.read_csv("y.csv", delimiter=',',encoding='utf-8',index_col=0,low_memory=False)

    
model = sm.OLS(y,x)
res = model.fit()
print(res.params)

model=LinearRegression(fit_intercept=False).fit(x,y)
print(model.coef_)

print((abs(res.params -model.coef_[0]).sum()))

Results

So right off the bat, we get different results when running this script the qiime environment and the base environment. The difference in the two is quite significant (for some parameters, the solutions differ by up to 27 orders of magnitude). When I looked into this further, it seems that all other environments I have run this script on have the same output as the base environment. Furthermore, within these agreeing environments, the sum of the absolute differences between the parameters estimated through sklearn and statsmodels is never greater that 1e-14 which is expected. However, in the qiime environment, this sum of differences is ~0.60.

Question

I would welcome any explanation for the discrepancy between the qiime environment and the other conda environments. I am curious to know if this is an effect of the python version I am using or something completely different.

thermokarst · June 22, 2020, 8:26pm

@jdjame - before diving too deeply into this, could I ask for a favor? Can you check to see if this is still an issue in the latest QIIME 2 staging environment?

https://dev.qiime2.org/latest/quickstart/

We are getting ready to cut a new release this week, your feedback would be really helpful.

Thanks! :qiime2:

jdjame · June 22, 2020, 8:49pm

@thermokarst Thanks for your quick reply! I just set up the latest QIIME 2 environment using the link you attached and got the same results.

thermokarst · June 22, 2020, 8:53pm

Thanks for checking! Okay, I just want to get this straight: in your testing, are the results produced by statsmodels changing (between environments), or just the scikit-learn results? Or both? It looks to me like you're using the same version of statsmodels in all of your tests, so if the results were unchanged there, I wouldn't be surprised. Similarly, all three of these envs have different versions of scikit-learn - differences in those results wouldn't surprise me, either.

jdjame · June 22, 2020, 11:04pm

The both results produced by statsmodels and scikit-learn differ as I change environments. Meaning that the results from statsmodels output in qiime differs from the results from statsmodels in the other environments ( as much as 1e27 times greater in the qiime environment). The same holds for scikit-learn.

In the non-qiime environment, the output of statsmodels and scikit-learn differs by about 2e-15 which is expected. In the qiime environment however, the output differs by about 0.60 which is much more consequential.

I have tested with different non-qiime environments and different versions of statsmodels, numpy and scikit-learn. In all the non-qiime environments, the parameters outputted are fairly similar (same 2e-15 difference ).

I hope this clears up a bit of the confusion.

thermokarst · June 22, 2020, 11:21pm

Thanks @jdjame, that clears it up a little bit, although maybe it would help if you prepared a table or something, illustrating the differences observed, and comparing the different versions of packages/deps in each environment (for example, you haven't listed what version of pandas you're using, but that is a direct dep for statsmodels (I think)).

Either way, I think maybe you should file an issue with the developers of statsmodels:

We aren't directly involved in the development of statsmodels, so we probably aren't the right people to ask, but we will lend a hand if you have something more specific you think we can comment on or address.

Otherwise, all the dependencies specified in a typical QIIME 2 environment are all from anaconda.org. We use 4 channels: defaults, bioconda, conda-forge, and qiime2 - the qiime2 channel just hosts plugins, no third-party dependencies. If you're experiencing an issue, it might be in the source of statsmodels (or skl), or perhaps related to the packaging and distribution (on defaults/bioconda/conda-forge).

Hope that helps - keep us posted!

jdjame · June 23, 2020, 2:23pm

Here is the table:

package ↓ \ env→	qiime	base
python	3.7.4	3.6.7
pandas	0.25.1	0.25.3
scikit-learn	0.22.1	0.21.3
statsmodels	0.11.1	0.11.1
numpy	1.17.2	1.18.1
difference between statsmodels estimate and scikit-lean estimates	.60	2e-15

Please note that here, the base environment's estimates are consistent with other non-qiime environment estimates (same values in different environments with about 2e-14 difference) whereas the qiime estimates are several orders of magnitude greater. I will file an issue with the statsmodels devs and get their input. Thanks!

thermokarst · June 23, 2020, 2:32pm

Thanks @jdjame - that clears things up a bit. One thing worth looking into is the blas/openblas/mkl dance that happens on the defaults and conda-forge channels - you might be comparing apples and oranges here, in terms of numerical libraries. You're probably going to want to build up a "minimum working example" for the statsmodel folks (if you come to them and say "the qiime env is broken" I wouldn't be surprised if they just shrugged and said "okay?"). Keep us posted!

PS - your table above focuses on the difference between two package's results, however you also reported above that you saw differences within-package (but across envs) - maybe you should focus on that problem, first - its probably the smaller (and simpler) one.

thermokarst · June 23, 2020, 2:40pm

Cross-referencing @jdjame's statsmodel issue here:

github.com/statsmodels/statsmodels

Different results for linear regression based on conda environment

opened 01:27PM - 23 Jun 20 UTC

closed 02:52PM - 23 Jun 20 UTC

jdjame

# Summary: I have noticed different results of the parameters I get as output …when running simple linear regression (with ordinary least squares) on the qiime conda environment and other conda environments. I came across this issue when I ran code in a conda environment and got drastically different results than when I ran it in other conda environments. I am curious to know if anyone can give me a reason for this difference # Specifics: The two environments used were a qiime2-2020.2 environment (python 3.6.7, ```statsmodels==0.11.1```, ```scikit-learn==0.22.1```, and ```numpy===1.18.1```) and a standard base conda environment (python 3.7.4, ```statsmodels==0.11.1```, ```scikit-learn==0.21.3```, and ```numpy===1.17.2```). My issue is that in the qiime environment, both statsmodels.api.OLS and sklearn.linear_model.LinearRegression give different values when fitted on the same input. # Sample code The following is some code that I use to observe the differenced in both. Here, *x.csv* is a csv of the design matrix used to fit and and *y.csv* is a csv of the response variable. The aim is for the code to fit 14 parameters (using ordinary least squares) on 303 samples. ``` import statsmodels.api as sm from sklearn.linear_model import LinearRegression import numpy as np import pandas as pd x= pd.read_csv("x.csv", delimiter=',',encoding='utf-8',index_col=0,low_memory=False) y=pd.read_csv("y.csv", delimiter=',',encoding='utf-8',index_col=0,low_memory=False) model = sm.OLS(y,x) res = model.fit() print(res.params) model=LinearRegression(fit_intercept=False).fit(x,y) print(model.coef_) print((abs(res.params -model.coef_[0]).sum())) ``` # Results So right off the bat, we get different results when running this script the qiime environment and the base environment. The difference in the two is quite significant (for some parameters, the solutions differ by up to 27 orders of magnitude). When I looked into this further, it seems that all other environments I have run this script on have the same output as the base environment. Furthermore, within these agreeing environments, the sum of the absolute differences between the parameters estimated through sklearn and statsmodels is never greater that 1e-14 which is expected. However, in the qiime environment, this sum of differences is ~0.60. Recap: | package\env | qiime | base | |-------------------------------------------------------------------|--------|--------| | python | 3.7.4 | 3.6.7 | | pandas | 0.25.1 | 0.25.3 | | scikit-learn | 0.22.1 | 0.21.3 | | statsmodels | 0.11.1 | 0.11.1 | | numpy | 1.17.2 | 1.18.1 | | difference between statsmodels estimate and scikit-lean estimate | .60 | 1e-14 | # Question I would welcome any explanation for the discrepancy between the **qiime** environment and the other conda environments. I am curious to know if this is an effect of the python version I am using or something completely different.

As of the time I am writing this, it sounds like the statsmodels folks are also suspicious about the blas-business.

system · July 24, 2020, 8:43pm

This topic was automatically closed 31 days after the last reply. New replies are no longer allowed.