Wednesday, November 3, 2010

Math is like music, statistics is like literature

An excellent link from Dr. Gelman's blog. Statistician Dick De Veaux says:

We haven’t evolved to be statisticians. Our students who think statistics is an unnatural subject are right. This isn’t how humans think naturally. But it is how humans think rationally. And it is how scientists think. This is the way we must think if we are to make progress in understanding how the world works and, for that matter, how we ourselves work.

Monday, November 1, 2010

Multilevel Modeling on UK School Data

I came across a dataset on UK schools that has scores, student / school gender information. After some digging on the Net, I found a sample Bugs model for this data, so I decided to give it a try. The data file was mentioned here http://dss.princeton.edu/training/Multilevel101.pdf and is at http://dss.princeton.edu/training/schools.dta.

I converted this data Python readable format and saved it as testing.dat. Sample Bugs model was at http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/reviewwinbugs.pdf

Other useful info: http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/reviewaml.pdf

The example code is written in Python / JAGS. The model can certainly be improved. Keyword dflat() does not seem to be supported anymore, so dnorm(0.0,1.0E-6) was used instead when necessary.

Data and recent code

import rpy2.robjects as R
import rpy2.rinterface as rinterface
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
from rpy2.robjects.vectors import FloatVector, StrVector
import numpy as np
r2jags = importr('R2jags')

schools = np.loadtxt("testing.dat",  skiprows=1)

n=len(schools)
J = np.unique(np.max(schools[:,0]))[0]
y = schools[:,2]
school = schools[:,0]
standlrt = schools[:,3] # reading score
gender = schools[:,4] # student gender
school_gender = schools[:,5] # school gender, 1 mixed, 2 boys, 3 girls, 

boysch = np.zeros(n)
boysch[school_gender == '2'] = 1

girlsch = np.zeros(n)
girlsch[school_gender == '3'] = 1

R.r.assign('n',n)
R.r.assign('J',J)
R.r.assign('y',y)
R.r.assign('school',school)
R.r.assign('standlrt',standlrt)
R.r.assign('gender',gender)
R.r.assign('boysch',boysch)
R.r.assign('girlsch',girlsch)

jags_data = StrVector(("n", "J", "y", "school", "standlrt", "gender", "boysch", "girlsch"))
jags_params = StrVector(("a", "beta","tau.y", "tau.a"))

jags_inits = R.r('''function (){
  list (a=rnorm(J),mu=rnorm(1),sigma.y=runif(1), sigma.a=runif(1))
  }''')

jags_schools = r2jags.jags(data = jags_data, inits = jags_inits,
                           parameters_to_save = jags_params,
                           n_iter = 10000,
                           n_chains = 3, n_thin = 100,
                           model_file = "testing.bug")

print jags_schools



Wednesday, October 13, 2010

N Maximum Elemens using Python

This method will return N topmost (maximum) elements in a list.
def n_max(list, n):
y = range(len(list))
i = sorted(y, key=list.__getitem__)
res = i[len(list)-n:len(list)]
res.reverse()
return res

Friday, September 24, 2010

Regression with Student-t Errors

In order to fit a regression that assumes error is distributed as Student-t instead of Normal, Statsmodels' TLinearModel is the class to use. On R same thing is accomplished using the "hett" library tlm function. We share here R and Python sample code that uses same data to report results using both functions.

Hett library can be installed on R by using install.packages("hett"). TLinearModel is part of a development branch on Statsmodels, you can get it here:

https://code.launchpad.net/~scipystats/statsmodels/devel

test_tlm.py

import numpy as np
from scipy import stats, special, optimize
import scikits.statsmodels as sm
from scikits.statsmodels.miscmodels import TLinearModel

nobs = 60
nvars = 2
df = 3

mm = np.loadtxt("mm.dat",  skiprows=1, usecols = (3,4))

data_exog = mm[:,1]
data_endog = mm[:,0]

data_exog = sm.add_constant(data_exog)

res_ols = sm.OLS(data_endog, data_exog).fit()

kurt = stats.kurtosis(res_ols.resid)
df_fromkurt = 6./kurt + 4

modp = TLinearModel(data_endog, data_exog)

start_value = 0.1*np.ones(data_exog.shape[1]+2)

start_value[:nvars] = res_ols.params
start_value[-2] = df_fromkurt 
start_value[-1] = np.sqrt(res_ols.scale) #0.5
modp.start_params = start_value

fixdf = np.nan * np.zeros(modp.start_params.shape)
fixdf[-2] = 5

modp.fixed_params = fixdf
modp.fixed_paramsmask = np.isnan(fixdf)
modp.start_params = modp.start_params[modp.fixed_paramsmask]

resp = modp.fit(start_params = modp.start_params, disp=1, method='nm',
                maxfun=10000, maxiter=5000)

print '---------------------------'
print resp.bse
print resp.params


test_tlm.R

library("hett")
data(mm, package = "hett")
attach(mm)
# apparently second ~ CRSP is for skasdicity correction
# tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">
tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">
summary(tfit2)

#(Intercept) -0.006437   0.008096  -0.795     0.43    
#CRSP         1.206853   0.205935   5.860 2.31e-07 ***

mm.dat

"date" "am.can" "m.marietta" "CRSP"
"1" "Jan.1982" -0.0596 -0.1365 -0.03
"2" "Feb.1982" -0.17 -0.0769 -0.0584
"3" "Mar.1982" 0.0276 -0.0575 -0.0181
"4" "Apr.1982" 0.0058 0.0526 0.0306
"5" "May1982" -0.0106 -0.0449 -0.0397
"6" "June1982" 0.045 -0.0859 -0.0295
"7" "Jul.1982" -0.0243 -0.0742 -0.0316
"8" "Aug.1982" 0.1135 0.6879 0.1176
"9" "Sept.1982" -0.0331 -0.077 0.0075
"10" "Oct.1982" 0.0468 0.085 0.1098
"11" "Nov.1982" -0.0223 0.003 0.0408
"12" "Dec.1982" -0.0026 0.0754 0.0095
"13" "Jan.1983" 0.0166 -0.0412 0.0301
"14" "Feb.1983" 0.0343 -0.089 0.0221
"15" "Mar.1983" 0.0443 0.2319 0.0269
"16" "Apr.1983" 0.1477 0.1087 0.0655
"17" "May1983" 0.1728 0.0375 -0.003
"18" "June1983" -0.0372 0.0958 0.0325
"19" "July1983" -0.0451 0.0174 -0.0374
"20" "Aug.1983" -0.0257 -0.0724 0.0049
"21" "Sept.1983" 0.0509 0.075 0.0105
"22" "Oct.1983" 0.0035 -0.0588 -0.0257
"23" "Nov.1983" 0.1334 -0.062 0.0186
"24" "Dec.1983" -0.0458 -0.0378 -0.0155
"25" "Jan.1984" 0.1199 0.0169 -0.0165
"26" "Feb.1284" -0.0766 -0.0799 -0.044
"27" "Mar.1984" -0.0511 -0.0147 0.0094
"28" "Apr.1984" -0.0194 0.0106 -0.0028
"29" "May1984" -0.0687 -0.0421 -0.0591
"30" "June1984" 0.0928 -0.0036 0.0158
"31" "July1984" -0.0704 0.0876 -0.0238
"32" "Aug.1984" 0.0905 0.1025 0.1031
"33" "Sept.1984" 0.0232 -0.0499 -0.0065
"34" "Oct.1984" -0.0054 0.1953 -0.0067
"35" "Nov.1984" 0.0082 -0.0714 -0.0167
"36" "Dec.1984" 0.0242 0.0469 0.0188
"37" "Jan.1985" 0.0153 0.1311 0.0733
"38" "Feb.1985" 0.0016 0.0461 0.0105
"39" "Mar.1985" 0.028 -0.0328 -0.007
"40" "Apr.1985" 0.0088 -0.0096 -0.0099
"41" "May1985" 0.0734 0.1272 0.0521
"42" "June1985" 0.0315 -0.0077 0.0117
"43" "July1985" -0.0276 0.0165 -0.0099
"44" "Aug.1985" 0.0162 -0.015 -0.0102
"45" "Sept.1985" -0.0975 -0.1479 -0.0428
"46" "Oct.1985" 0.0563 -0.0065 0.0376
"47" "Nov.1985" 0.1368 0.039 0.0628
"48" "Dec.1985" -0.069 0.0223 0.0391
"49" "Jan.1986" 0.1044 -0.069 2e-04
"50" "Feb.1986" 0.1636 0.1338 0.0688
"51" "Mar.1986" -0.019 0.1458 0.0486
"52" "Apr.1986" -0.0746 0.0063 -0.0174
"53" "May1986" 0.0433 0.0692 0.046
"54" "June1986" 0.0306 -0.0239 0.01
"55" "July1986" 0.0636 -0.0568 -0.0594
"56" "Aug.1986" 0.0917 0.0814 0.068
"57" "Sept.1986" -0.0796 -0.0889 -0.0839
"58" "Oct.1986" 0.0778 -0.0887 0.0481
"59" "Nov.1986" -0.0353 0.1037 0.0136
"60" "Dec.1986" -0.0137 -0.1163 -0.0322