Friday, September 24, 2010

Regression with Student-t Errors

In order to fit a regression that assumes error is distributed as Student-t instead of Normal, Statsmodels' TLinearModel is the class to use. On R same thing is accomplished using the "hett" library tlm function. We share here R and Python sample code that uses same data to report results using both functions.

Hett library can be installed on R by using install.packages("hett"). TLinearModel is part of a development branch on Statsmodels, you can get it here:

https://code.launchpad.net/~scipystats/statsmodels/devel

test_tlm.py

import numpy as np
from scipy import stats, special, optimize
import scikits.statsmodels as sm
from scikits.statsmodels.miscmodels import TLinearModel

nobs = 60
nvars = 2
df = 3

mm = np.loadtxt("mm.dat",  skiprows=1, usecols = (3,4))

data_exog = mm[:,1]
data_endog = mm[:,0]

data_exog = sm.add_constant(data_exog)

res_ols = sm.OLS(data_endog, data_exog).fit()

kurt = stats.kurtosis(res_ols.resid)
df_fromkurt = 6./kurt + 4

modp = TLinearModel(data_endog, data_exog)

start_value = 0.1*np.ones(data_exog.shape[1]+2)

start_value[:nvars] = res_ols.params
start_value[-2] = df_fromkurt 
start_value[-1] = np.sqrt(res_ols.scale) #0.5
modp.start_params = start_value

fixdf = np.nan * np.zeros(modp.start_params.shape)
fixdf[-2] = 5

modp.fixed_params = fixdf
modp.fixed_paramsmask = np.isnan(fixdf)
modp.start_params = modp.start_params[modp.fixed_paramsmask]

resp = modp.fit(start_params = modp.start_params, disp=1, method='nm',
                maxfun=10000, maxiter=5000)

print '---------------------------'
print resp.bse
print resp.params


test_tlm.R

library("hett")
data(mm, package = "hett")
attach(mm)
# apparently second ~ CRSP is for skasdicity correction
# tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">
tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">
summary(tfit2)

#(Intercept) -0.006437   0.008096  -0.795     0.43    
#CRSP         1.206853   0.205935   5.860 2.31e-07 ***

mm.dat

"date" "am.can" "m.marietta" "CRSP"
"1" "Jan.1982" -0.0596 -0.1365 -0.03
"2" "Feb.1982" -0.17 -0.0769 -0.0584
"3" "Mar.1982" 0.0276 -0.0575 -0.0181
"4" "Apr.1982" 0.0058 0.0526 0.0306
"5" "May1982" -0.0106 -0.0449 -0.0397
"6" "June1982" 0.045 -0.0859 -0.0295
"7" "Jul.1982" -0.0243 -0.0742 -0.0316
"8" "Aug.1982" 0.1135 0.6879 0.1176
"9" "Sept.1982" -0.0331 -0.077 0.0075
"10" "Oct.1982" 0.0468 0.085 0.1098
"11" "Nov.1982" -0.0223 0.003 0.0408
"12" "Dec.1982" -0.0026 0.0754 0.0095
"13" "Jan.1983" 0.0166 -0.0412 0.0301
"14" "Feb.1983" 0.0343 -0.089 0.0221
"15" "Mar.1983" 0.0443 0.2319 0.0269
"16" "Apr.1983" 0.1477 0.1087 0.0655
"17" "May1983" 0.1728 0.0375 -0.003
"18" "June1983" -0.0372 0.0958 0.0325
"19" "July1983" -0.0451 0.0174 -0.0374
"20" "Aug.1983" -0.0257 -0.0724 0.0049
"21" "Sept.1983" 0.0509 0.075 0.0105
"22" "Oct.1983" 0.0035 -0.0588 -0.0257
"23" "Nov.1983" 0.1334 -0.062 0.0186
"24" "Dec.1983" -0.0458 -0.0378 -0.0155
"25" "Jan.1984" 0.1199 0.0169 -0.0165
"26" "Feb.1284" -0.0766 -0.0799 -0.044
"27" "Mar.1984" -0.0511 -0.0147 0.0094
"28" "Apr.1984" -0.0194 0.0106 -0.0028
"29" "May1984" -0.0687 -0.0421 -0.0591
"30" "June1984" 0.0928 -0.0036 0.0158
"31" "July1984" -0.0704 0.0876 -0.0238
"32" "Aug.1984" 0.0905 0.1025 0.1031
"33" "Sept.1984" 0.0232 -0.0499 -0.0065
"34" "Oct.1984" -0.0054 0.1953 -0.0067
"35" "Nov.1984" 0.0082 -0.0714 -0.0167
"36" "Dec.1984" 0.0242 0.0469 0.0188
"37" "Jan.1985" 0.0153 0.1311 0.0733
"38" "Feb.1985" 0.0016 0.0461 0.0105
"39" "Mar.1985" 0.028 -0.0328 -0.007
"40" "Apr.1985" 0.0088 -0.0096 -0.0099
"41" "May1985" 0.0734 0.1272 0.0521
"42" "June1985" 0.0315 -0.0077 0.0117
"43" "July1985" -0.0276 0.0165 -0.0099
"44" "Aug.1985" 0.0162 -0.015 -0.0102
"45" "Sept.1985" -0.0975 -0.1479 -0.0428
"46" "Oct.1985" 0.0563 -0.0065 0.0376
"47" "Nov.1985" 0.1368 0.039 0.0628
"48" "Dec.1985" -0.069 0.0223 0.0391
"49" "Jan.1986" 0.1044 -0.069 2e-04
"50" "Feb.1986" 0.1636 0.1338 0.0688
"51" "Mar.1986" -0.019 0.1458 0.0486
"52" "Apr.1986" -0.0746 0.0063 -0.0174
"53" "May1986" 0.0433 0.0692 0.046
"54" "June1986" 0.0306 -0.0239 0.01
"55" "July1986" 0.0636 -0.0568 -0.0594
"56" "Aug.1986" 0.0917 0.0814 0.068
"57" "Sept.1986" -0.0796 -0.0889 -0.0839
"58" "Oct.1986" 0.0778 -0.0887 0.0481
"59" "Nov.1986" -0.0353 0.1037 0.0136
"60" "Dec.1986" -0.0137 -0.1163 -0.0322

Tuesday, September 14, 2010

Statsmodels - Ordinary Least Squares

Here is a simple example of performing ordinary least squares using Scikits Statsmodels. Extra column of '1's in X matrix was added to determine the intercept of the regression line, sm.add_constant() call could also be used to add this extra column.
import numpy as np
import scikits.statsmodels as sm

y = [11,14,19,26]

X = [[1,1],[2,1],[3,1],[4,1]]

olsmod = sm.OLS(y, X)
olsres = olsmod.fit()

print olsres.params
print olsres.bse

# predict using new data points
ypred = olsmod.predict([[5,1],[6,1]])
print ypred

Friday, September 10, 2010

GLM and Python, scikits.statsmodels

Python version of the code for ARM 6.2 uses GLM() call under the scikits.statsmodels package. AFAIK there is no equivalent call for R factor() in this package. Also, in order to get the intercept as in R, we need to add an extra column of '1's. Other than that, modified R calls (withouts factors) and Python GLM calls match exactly.
import numpy
import scikits.statsmodels as sm
from scipy import stats

data = numpy.loadtxt("frisk_with_noise.dat", skiprows=7)

X = numpy.zeros((3,len(data[:,0])))
print X.shape

arrests = data[:,2]
arrests[arrests == 0] = 1
arrests = numpy.log(arrests)

stops = data[:,0]
stops[stops==0.0] = .0001

X[0,:] = arrests # arrests
X[1,:] = data[:,4] # eth
X[2,:] = numpy.ones(len(data[:,0])) # eth

glm = sm.GLM(stops, X.T, family=sm.family.Poisson())
res = glm.fit()
Google Groups