We haven’t evolved to be statisticians. Our students who think statistics is an unnatural subject are right. This isn’t how humans think naturally. But it is how humans think rationally. And it is how scientists think. This is the way we must think if we are to make progress in understanding how the world works and, for that matter, how we ourselves work.
Wednesday, November 3, 2010
Math is like music, statistics is like literature
An excellent link from Dr. Gelman's blog. Statistician Dick De Veaux says:
Monday, November 1, 2010
Multilevel Modeling on UK School Data
I came across a dataset on UK schools that has scores, student / school gender information. After some digging on the Net, I found a sample Bugs model for this data, so I decided to give it a try. The data file was mentioned here http://dss.princeton.edu/training/Multilevel101.pdf and is at http://dss.princeton.edu/training/schools.dta.
I converted this data Python readable format and saved it as testing.dat. Sample Bugs model was at http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/reviewwinbugs.pdf
Other useful info: http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/reviewaml.pdf
The example code is written in Python / JAGS. The model can certainly be improved. Keyword dflat() does not seem to be supported anymore, so dnorm(0.0,1.0E-6) was used instead when necessary.
Data and recent code
import rpy2.robjects as R
import rpy2.rinterface as rinterface
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
from rpy2.robjects.vectors import FloatVector, StrVector
import numpy as np
r2jags = importr('R2jags')
schools = np.loadtxt("testing.dat", skiprows=1)
n=len(schools)
J = np.unique(np.max(schools[:,0]))[0]
y = schools[:,2]
school = schools[:,0]
standlrt = schools[:,3] # reading score
gender = schools[:,4] # student gender
school_gender = schools[:,5] # school gender, 1 mixed, 2 boys, 3 girls,
boysch = np.zeros(n)
boysch[school_gender == '2'] = 1
girlsch = np.zeros(n)
girlsch[school_gender == '3'] = 1
R.r.assign('n',n)
R.r.assign('J',J)
R.r.assign('y',y)
R.r.assign('school',school)
R.r.assign('standlrt',standlrt)
R.r.assign('gender',gender)
R.r.assign('boysch',boysch)
R.r.assign('girlsch',girlsch)
jags_data = StrVector(("n", "J", "y", "school", "standlrt", "gender", "boysch", "girlsch"))
jags_params = StrVector(("a", "beta","tau.y", "tau.a"))
jags_inits = R.r('''function (){
list (a=rnorm(J),mu=rnorm(1),sigma.y=runif(1), sigma.a=runif(1))
}''')
jags_schools = r2jags.jags(data = jags_data, inits = jags_inits,
parameters_to_save = jags_params,
n_iter = 10000,
n_chains = 3, n_thin = 100,
model_file = "testing.bug")
print jags_schools
I converted this data Python readable format and saved it as testing.dat. Sample Bugs model was at http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/reviewwinbugs.pdf
Other useful info: http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/reviewaml.pdf
The example code is written in Python / JAGS. The model can certainly be improved. Keyword dflat() does not seem to be supported anymore, so dnorm(0.0,1.0E-6) was used instead when necessary.
Data and recent code
import rpy2.robjects as R
import rpy2.rinterface as rinterface
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
from rpy2.robjects.vectors import FloatVector, StrVector
import numpy as np
r2jags = importr('R2jags')
schools = np.loadtxt("testing.dat", skiprows=1)
n=len(schools)
J = np.unique(np.max(schools[:,0]))[0]
y = schools[:,2]
school = schools[:,0]
standlrt = schools[:,3] # reading score
gender = schools[:,4] # student gender
school_gender = schools[:,5] # school gender, 1 mixed, 2 boys, 3 girls,
boysch = np.zeros(n)
boysch[school_gender == '2'] = 1
girlsch = np.zeros(n)
girlsch[school_gender == '3'] = 1
R.r.assign('n',n)
R.r.assign('J',J)
R.r.assign('y',y)
R.r.assign('school',school)
R.r.assign('standlrt',standlrt)
R.r.assign('gender',gender)
R.r.assign('boysch',boysch)
R.r.assign('girlsch',girlsch)
jags_data = StrVector(("n", "J", "y", "school", "standlrt", "gender", "boysch", "girlsch"))
jags_params = StrVector(("a", "beta","tau.y", "tau.a"))
jags_inits = R.r('''function (){
list (a=rnorm(J),mu=rnorm(1),sigma.y=runif(1), sigma.a=runif(1))
}''')
jags_schools = r2jags.jags(data = jags_data, inits = jags_inits,
parameters_to_save = jags_params,
n_iter = 10000,
n_chains = 3, n_thin = 100,
model_file = "testing.bug")
print jags_schools
Wednesday, October 13, 2010
N Maximum Elemens using Python
This method will return N topmost (maximum) elements in a list.
def n_max(list, n):
y = range(len(list))
i = sorted(y, key=list.__getitem__)
res = i[len(list)-n:len(list)]
res.reverse()
return res
Friday, September 24, 2010
Regression with Student-t Errors
In order to fit a regression that assumes error is distributed as Student-t instead of Normal, Statsmodels' TLinearModel is the class to use. On R same thing is accomplished using the "hett" library tlm function. We share here R and Python sample code that uses same data to report results using both functions.
Hett library can be installed on R by using install.packages("hett"). TLinearModel is part of a development branch on Statsmodels, you can get it here:
https://code.launchpad.net/~scipystats/statsmodels/devel
test_tlm.py
import numpy as np
from scipy import stats, special, optimize
import scikits.statsmodels as sm
from scikits.statsmodels.miscmodels import TLinearModel
nobs = 60
nvars = 2
df = 3
mm = np.loadtxt("mm.dat", skiprows=1, usecols = (3,4))
data_exog = mm[:,1]
data_endog = mm[:,0]
data_exog = sm.add_constant(data_exog)
res_ols = sm.OLS(data_endog, data_exog).fit()
kurt = stats.kurtosis(res_ols.resid)
df_fromkurt = 6./kurt + 4
modp = TLinearModel(data_endog, data_exog)
start_value = 0.1*np.ones(data_exog.shape[1]+2)
start_value[:nvars] = res_ols.params
start_value[-2] = df_fromkurt
start_value[-1] = np.sqrt(res_ols.scale) #0.5
modp.start_params = start_value
fixdf = np.nan * np.zeros(modp.start_params.shape)
fixdf[-2] = 5
modp.fixed_params = fixdf
modp.fixed_paramsmask = np.isnan(fixdf)
modp.start_params = modp.start_params[modp.fixed_paramsmask]
resp = modp.fit(start_params = modp.start_params, disp=1, method='nm',
maxfun=10000, maxiter=5000)
print '---------------------------'
print resp.bse
print resp.params
test_tlm.R
library("hett")
data(mm, package = "hett")
attach(mm)
# apparently second ~ CRSP is for skasdicity correction
# tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">->
tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">->
summary(tfit2)
#(Intercept) -0.006437 0.008096 -0.795 0.43
#CRSP 1.206853 0.205935 5.860 2.31e-07 ***
Hett library can be installed on R by using install.packages("hett"). TLinearModel is part of a development branch on Statsmodels, you can get it here:
https://code.launchpad.net/~scipystats/statsmodels/devel
test_tlm.py
import numpy as np
from scipy import stats, special, optimize
import scikits.statsmodels as sm
from scikits.statsmodels.miscmodels import TLinearModel
nobs = 60
nvars = 2
df = 3
mm = np.loadtxt("mm.dat", skiprows=1, usecols = (3,4))
data_exog = mm[:,1]
data_endog = mm[:,0]
data_exog = sm.add_constant(data_exog)
res_ols = sm.OLS(data_endog, data_exog).fit()
kurt = stats.kurtosis(res_ols.resid)
df_fromkurt = 6./kurt + 4
modp = TLinearModel(data_endog, data_exog)
start_value = 0.1*np.ones(data_exog.shape[1]+2)
start_value[:nvars] = res_ols.params
start_value[-2] = df_fromkurt
start_value[-1] = np.sqrt(res_ols.scale) #0.5
modp.start_params = start_value
fixdf = np.nan * np.zeros(modp.start_params.shape)
fixdf[-2] = 5
modp.fixed_params = fixdf
modp.fixed_paramsmask = np.isnan(fixdf)
modp.start_params = modp.start_params[modp.fixed_paramsmask]
resp = modp.fit(start_params = modp.start_params, disp=1, method='nm',
maxfun=10000, maxiter=5000)
print '---------------------------'
print resp.bse
print resp.params
test_tlm.R
library("hett")
data(mm, package = "hett")
attach(mm)
# apparently second ~ CRSP is for skasdicity correction
# tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">->
tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">->
summary(tfit2)
#(Intercept) -0.006437 0.008096 -0.795 0.43
#CRSP 1.206853 0.205935 5.860 2.31e-07 ***
mm.dat
"date" "am.can" "m.marietta" "CRSP"
"1" "Jan.1982" -0.0596 -0.1365 -0.03
"2" "Feb.1982" -0.17 -0.0769 -0.0584
"3" "Mar.1982" 0.0276 -0.0575 -0.0181
"4" "Apr.1982" 0.0058 0.0526 0.0306
"5" "May1982" -0.0106 -0.0449 -0.0397
"6" "June1982" 0.045 -0.0859 -0.0295
"7" "Jul.1982" -0.0243 -0.0742 -0.0316
"8" "Aug.1982" 0.1135 0.6879 0.1176
"9" "Sept.1982" -0.0331 -0.077 0.0075
"10" "Oct.1982" 0.0468 0.085 0.1098
"11" "Nov.1982" -0.0223 0.003 0.0408
"12" "Dec.1982" -0.0026 0.0754 0.0095
"13" "Jan.1983" 0.0166 -0.0412 0.0301
"14" "Feb.1983" 0.0343 -0.089 0.0221
"15" "Mar.1983" 0.0443 0.2319 0.0269
"16" "Apr.1983" 0.1477 0.1087 0.0655
"17" "May1983" 0.1728 0.0375 -0.003
"18" "June1983" -0.0372 0.0958 0.0325
"19" "July1983" -0.0451 0.0174 -0.0374
"20" "Aug.1983" -0.0257 -0.0724 0.0049
"21" "Sept.1983" 0.0509 0.075 0.0105
"22" "Oct.1983" 0.0035 -0.0588 -0.0257
"23" "Nov.1983" 0.1334 -0.062 0.0186
"24" "Dec.1983" -0.0458 -0.0378 -0.0155
"25" "Jan.1984" 0.1199 0.0169 -0.0165
"26" "Feb.1284" -0.0766 -0.0799 -0.044
"27" "Mar.1984" -0.0511 -0.0147 0.0094
"28" "Apr.1984" -0.0194 0.0106 -0.0028
"29" "May1984" -0.0687 -0.0421 -0.0591
"30" "June1984" 0.0928 -0.0036 0.0158
"31" "July1984" -0.0704 0.0876 -0.0238
"32" "Aug.1984" 0.0905 0.1025 0.1031
"33" "Sept.1984" 0.0232 -0.0499 -0.0065
"34" "Oct.1984" -0.0054 0.1953 -0.0067
"35" "Nov.1984" 0.0082 -0.0714 -0.0167
"36" "Dec.1984" 0.0242 0.0469 0.0188
"37" "Jan.1985" 0.0153 0.1311 0.0733
"38" "Feb.1985" 0.0016 0.0461 0.0105
"39" "Mar.1985" 0.028 -0.0328 -0.007
"40" "Apr.1985" 0.0088 -0.0096 -0.0099
"41" "May1985" 0.0734 0.1272 0.0521
"42" "June1985" 0.0315 -0.0077 0.0117
"43" "July1985" -0.0276 0.0165 -0.0099
"44" "Aug.1985" 0.0162 -0.015 -0.0102
"45" "Sept.1985" -0.0975 -0.1479 -0.0428
"46" "Oct.1985" 0.0563 -0.0065 0.0376
"47" "Nov.1985" 0.1368 0.039 0.0628
"48" "Dec.1985" -0.069 0.0223 0.0391
"49" "Jan.1986" 0.1044 -0.069 2e-04
"50" "Feb.1986" 0.1636 0.1338 0.0688
"51" "Mar.1986" -0.019 0.1458 0.0486
"52" "Apr.1986" -0.0746 0.0063 -0.0174
"53" "May1986" 0.0433 0.0692 0.046
"54" "June1986" 0.0306 -0.0239 0.01
"55" "July1986" 0.0636 -0.0568 -0.0594
"56" "Aug.1986" 0.0917 0.0814 0.068
"57" "Sept.1986" -0.0796 -0.0889 -0.0839
"58" "Oct.1986" 0.0778 -0.0887 0.0481
"59" "Nov.1986" -0.0353 0.1037 0.0136
"60" "Dec.1986" -0.0137 -0.1163 -0.0322
Subscribe to:
Posts (Atom)