Summary:
I have noticed different results of the parameters I get as output when running simple linear regression (with ordinary least squares) on the qiime conda environment and other conda environments. I am curious to know if anyone can give me a reason for this difference
Specifics:
The two environments used were a qiime2-2020.2 environment (python 3.6.7, statsmodels==0.11.1
, scikit-learn==0.22.1
, and numpy===1.18.1
) and a standard base conda environment (python 3.7.4, statsmodels==0.11.1
, scikit-learn==0.21.3
, and numpy===1.17.2
). My issue is that in the qiime environment, both statsmodels.api.OLS and sklearn.linear_model.LinearRegression give different values when fitted on the same input.
Sample code
The following is some code that I use to observe the differenced in both. Here, x.csv is a csv of the design matrix used to fit and and y.csv is a csv of the response variable. The aim is for the code to fit 14 parameters (using ordinary least squares) on 303 samples.
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd
x= pd.read_csv("x.csv", delimiter=',',encoding='utf-8',index_col=0,low_memory=False)
y=pd.read_csv("y.csv", delimiter=',',encoding='utf-8',index_col=0,low_memory=False)
model = sm.OLS(y,x)
res = model.fit()
print(res.params)
model=LinearRegression(fit_intercept=False).fit(x,y)
print(model.coef_)
print((abs(res.params -model.coef_[0]).sum()))
Results
So right off the bat, we get different results when running this script the qiime environment and the base environment. The difference in the two is quite significant (for some parameters, the solutions differ by up to 27 orders of magnitude). When I looked into this further, it seems that all other environments I have run this script on have the same output as the base environment. Furthermore, within these agreeing environments, the sum of the absolute differences between the parameters estimated through sklearn and statsmodels is never greater that 1e-14 which is expected. However, in the qiime environment, this sum of differences is ~0.60.
Question
I would welcome any explanation for the discrepancy between the qiime environment and the other conda environments. I am curious to know if this is an effect of the python version I am using or something completely different.