Friday, December 24, 2010

SVM Derivation

Saturday, December 18, 2010

Fourier Transform and Sunspots Data

I recoded parts of the Fourier analysis example that uses sunspots data found here. It needed updating for most recent Python. All graphs match the original output.
import numpy as np
from scipy import fft, array
from matplotlib import pyplot as gplt

tempdata = np.loadtxt("sunspots.dat")

year=tempdata[:,0]
wolfer=tempdata[:,1]

gplt.plot(year,wolfer)
gplt.show()

gplt.plot(year[0:50],wolfer[0:50])
gplt.show()

Y=fft(wolfer)
gplt.plot(Y.real,Y.imag, 'r+')
gplt.xlim(-4000,2000)
gplt.ylim(-4000,4000)
gplt.show()

n=len(Y)
power = np.abs(Y[1:(n/2)])**2
nyquist=1./2
freq=array(range(n/2))/(n/2.0)*nyquist
freq += 0.00001
period=1./freq
gplt.plot(1/period[1:len(period)], power)
gplt.xlabel('cycles/year')
gplt.title('Periodogram')
gplt.show()

gplt.plot(1/period[1:40], power[1:40])
gplt.show()

gplt.plot(period[1:len(period)], power)
gplt.xlim(0,40)
gplt.show()

Wednesday, November 3, 2010

Math is like music, statistics is like literature

An excellent link from Dr. Gelman's blog. Statistician Dick De Veaux says:

We haven’t evolved to be statisticians. Our students who think statistics is an unnatural subject are right. This isn’t how humans think naturally. But it is how humans think rationally. And it is how scientists think. This is the way we must think if we are to make progress in understanding how the world works and, for that matter, how we ourselves work.

Monday, November 1, 2010

Multilevel Modeling on UK School Data

I came across a dataset on UK schools that has scores, student / school gender information. After some digging on the Net, I found a sample Bugs model for this data, so I decided to give it a try. The data file was mentioned here http://dss.princeton.edu/training/Multilevel101.pdf and is at http://dss.princeton.edu/training/schools.dta.

I converted this data Python readable format and saved it as testing.dat. Sample Bugs model was at http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/reviewwinbugs.pdf

Other useful info: http://www.cmm.bristol.ac.uk/learning-training/multilevel-m-software/reviewaml.pdf

The example code is written in Python / JAGS. The model can certainly be improved. Keyword dflat() does not seem to be supported anymore, so dnorm(0.0,1.0E-6) was used instead when necessary.

Data and recent code

import rpy2.robjects as R
import rpy2.rinterface as rinterface
from rpy2.robjects.packages import importr
import rpy2.robjects.numpy2ri
from rpy2.robjects.vectors import FloatVector, StrVector
import numpy as np
r2jags = importr('R2jags')

schools = np.loadtxt("testing.dat",  skiprows=1)

n=len(schools)
J = np.unique(np.max(schools[:,0]))[0]
y = schools[:,2]
school = schools[:,0]
standlrt = schools[:,3] # reading score
gender = schools[:,4] # student gender
school_gender = schools[:,5] # school gender, 1 mixed, 2 boys, 3 girls, 

boysch = np.zeros(n)
boysch[school_gender == '2'] = 1

girlsch = np.zeros(n)
girlsch[school_gender == '3'] = 1

R.r.assign('n',n)
R.r.assign('J',J)
R.r.assign('y',y)
R.r.assign('school',school)
R.r.assign('standlrt',standlrt)
R.r.assign('gender',gender)
R.r.assign('boysch',boysch)
R.r.assign('girlsch',girlsch)

jags_data = StrVector(("n", "J", "y", "school", "standlrt", "gender", "boysch", "girlsch"))
jags_params = StrVector(("a", "beta","tau.y", "tau.a"))

jags_inits = R.r('''function (){
  list (a=rnorm(J),mu=rnorm(1),sigma.y=runif(1), sigma.a=runif(1))
  }''')

jags_schools = r2jags.jags(data = jags_data, inits = jags_inits,
                           parameters_to_save = jags_params,
                           n_iter = 10000,
                           n_chains = 3, n_thin = 100,
                           model_file = "testing.bug")

print jags_schools



Wednesday, October 13, 2010

N Maximum Elemens using Python

This method will return N topmost (maximum) elements in a list.
def n_max(list, n):
y = range(len(list))
i = sorted(y, key=list.__getitem__)
res = i[len(list)-n:len(list)]
res.reverse()
return res

Friday, September 24, 2010

Regression with Student-t Errors

In order to fit a regression that assumes error is distributed as Student-t instead of Normal, Statsmodels' TLinearModel is the class to use. On R same thing is accomplished using the "hett" library tlm function. We share here R and Python sample code that uses same data to report results using both functions.

Hett library can be installed on R by using install.packages("hett"). TLinearModel is part of a development branch on Statsmodels, you can get it here:

https://code.launchpad.net/~scipystats/statsmodels/devel

test_tlm.py

import numpy as np
from scipy import stats, special, optimize
import scikits.statsmodels as sm
from scikits.statsmodels.miscmodels import TLinearModel

nobs = 60
nvars = 2
df = 3

mm = np.loadtxt("mm.dat",  skiprows=1, usecols = (3,4))

data_exog = mm[:,1]
data_endog = mm[:,0]

data_exog = sm.add_constant(data_exog)

res_ols = sm.OLS(data_endog, data_exog).fit()

kurt = stats.kurtosis(res_ols.resid)
df_fromkurt = 6./kurt + 4

modp = TLinearModel(data_endog, data_exog)

start_value = 0.1*np.ones(data_exog.shape[1]+2)

start_value[:nvars] = res_ols.params
start_value[-2] = df_fromkurt 
start_value[-1] = np.sqrt(res_ols.scale) #0.5
modp.start_params = start_value

fixdf = np.nan * np.zeros(modp.start_params.shape)
fixdf[-2] = 5

modp.fixed_params = fixdf
modp.fixed_paramsmask = np.isnan(fixdf)
modp.start_params = modp.start_params[modp.fixed_paramsmask]

resp = modp.fit(start_params = modp.start_params, disp=1, method='nm',
                maxfun=10000, maxiter=5000)

print '---------------------------'
print resp.bse
print resp.params


test_tlm.R

library("hett")
data(mm, package = "hett")
attach(mm)
# apparently second ~ CRSP is for skasdicity correction
# tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">
tfit2 <- crsp="" data="mm," estdof="TRUE)</font" m.marietta="" start="list(dof" tlm="">
summary(tfit2)

#(Intercept) -0.006437   0.008096  -0.795     0.43    
#CRSP         1.206853   0.205935   5.860 2.31e-07 ***

mm.dat

"date" "am.can" "m.marietta" "CRSP"
"1" "Jan.1982" -0.0596 -0.1365 -0.03
"2" "Feb.1982" -0.17 -0.0769 -0.0584
"3" "Mar.1982" 0.0276 -0.0575 -0.0181
"4" "Apr.1982" 0.0058 0.0526 0.0306
"5" "May1982" -0.0106 -0.0449 -0.0397
"6" "June1982" 0.045 -0.0859 -0.0295
"7" "Jul.1982" -0.0243 -0.0742 -0.0316
"8" "Aug.1982" 0.1135 0.6879 0.1176
"9" "Sept.1982" -0.0331 -0.077 0.0075
"10" "Oct.1982" 0.0468 0.085 0.1098
"11" "Nov.1982" -0.0223 0.003 0.0408
"12" "Dec.1982" -0.0026 0.0754 0.0095
"13" "Jan.1983" 0.0166 -0.0412 0.0301
"14" "Feb.1983" 0.0343 -0.089 0.0221
"15" "Mar.1983" 0.0443 0.2319 0.0269
"16" "Apr.1983" 0.1477 0.1087 0.0655
"17" "May1983" 0.1728 0.0375 -0.003
"18" "June1983" -0.0372 0.0958 0.0325
"19" "July1983" -0.0451 0.0174 -0.0374
"20" "Aug.1983" -0.0257 -0.0724 0.0049
"21" "Sept.1983" 0.0509 0.075 0.0105
"22" "Oct.1983" 0.0035 -0.0588 -0.0257
"23" "Nov.1983" 0.1334 -0.062 0.0186
"24" "Dec.1983" -0.0458 -0.0378 -0.0155
"25" "Jan.1984" 0.1199 0.0169 -0.0165
"26" "Feb.1284" -0.0766 -0.0799 -0.044
"27" "Mar.1984" -0.0511 -0.0147 0.0094
"28" "Apr.1984" -0.0194 0.0106 -0.0028
"29" "May1984" -0.0687 -0.0421 -0.0591
"30" "June1984" 0.0928 -0.0036 0.0158
"31" "July1984" -0.0704 0.0876 -0.0238
"32" "Aug.1984" 0.0905 0.1025 0.1031
"33" "Sept.1984" 0.0232 -0.0499 -0.0065
"34" "Oct.1984" -0.0054 0.1953 -0.0067
"35" "Nov.1984" 0.0082 -0.0714 -0.0167
"36" "Dec.1984" 0.0242 0.0469 0.0188
"37" "Jan.1985" 0.0153 0.1311 0.0733
"38" "Feb.1985" 0.0016 0.0461 0.0105
"39" "Mar.1985" 0.028 -0.0328 -0.007
"40" "Apr.1985" 0.0088 -0.0096 -0.0099
"41" "May1985" 0.0734 0.1272 0.0521
"42" "June1985" 0.0315 -0.0077 0.0117
"43" "July1985" -0.0276 0.0165 -0.0099
"44" "Aug.1985" 0.0162 -0.015 -0.0102
"45" "Sept.1985" -0.0975 -0.1479 -0.0428
"46" "Oct.1985" 0.0563 -0.0065 0.0376
"47" "Nov.1985" 0.1368 0.039 0.0628
"48" "Dec.1985" -0.069 0.0223 0.0391
"49" "Jan.1986" 0.1044 -0.069 2e-04
"50" "Feb.1986" 0.1636 0.1338 0.0688
"51" "Mar.1986" -0.019 0.1458 0.0486
"52" "Apr.1986" -0.0746 0.0063 -0.0174
"53" "May1986" 0.0433 0.0692 0.046
"54" "June1986" 0.0306 -0.0239 0.01
"55" "July1986" 0.0636 -0.0568 -0.0594
"56" "Aug.1986" 0.0917 0.0814 0.068
"57" "Sept.1986" -0.0796 -0.0889 -0.0839
"58" "Oct.1986" 0.0778 -0.0887 0.0481
"59" "Nov.1986" -0.0353 0.1037 0.0136
"60" "Dec.1986" -0.0137 -0.1163 -0.0322

Tuesday, September 14, 2010

Statsmodels - Ordinary Least Squares

Here is a simple example of performing ordinary least squares using Scikits Statsmodels. Extra column of '1's in X matrix was added to determine the intercept of the regression line, sm.add_constant() call could also be used to add this extra column.
import numpy as np
import scikits.statsmodels as sm

y = [11,14,19,26]

X = [[1,1],[2,1],[3,1],[4,1]]

olsmod = sm.OLS(y, X)
olsres = olsmod.fit()

print olsres.params
print olsres.bse

# predict using new data points
ypred = olsmod.predict([[5,1],[6,1]])
print ypred

Friday, September 10, 2010

GLM and Python, scikits.statsmodels

Python version of the code for ARM 6.2 uses GLM() call under the scikits.statsmodels package. AFAIK there is no equivalent call for R factor() in this package. Also, in order to get the intercept as in R, we need to add an extra column of '1's. Other than that, modified R calls (withouts factors) and Python GLM calls match exactly.
import numpy
import scikits.statsmodels as sm
from scipy import stats

data = numpy.loadtxt("frisk_with_noise.dat", skiprows=7)

X = numpy.zeros((3,len(data[:,0])))
print X.shape

arrests = data[:,2]
arrests[arrests == 0] = 1
arrests = numpy.log(arrests)

stops = data[:,0]
stops[stops==0.0] = .0001

X[0,:] = arrests # arrests
X[1,:] = data[:,4] # eth
X[2,:] = numpy.ones(len(data[:,0])) # eth

glm = sm.GLM(stops, X.T, family=sm.family.Poisson())
res = glm.fit()
Google Groups

Thursday, August 26, 2010

GCSR - Example 3.7

Attached is the Python source code for GCSR Example 3.7, the bioassay problem. The code is based on Helle Sorensen's code at KU. It samples from alpha, beta posteriors using grid approximation and displays a contour plot, and then, samples from the joint distribution and displays an X-Y plot of the sampled values. The plots duplicate Figure 3.4 (a) and (b) from GCSR.

Code

Wednesday, August 18, 2010

GCSR

I will publish sample code (Python), and solutions for selected questions found in Andrew Gelman's excellent book Bayesian Data Analysis 2nd Edition (GCSR). Each posting will be solution for one question, if required, will include mathematical formulas (using mathurl.com), and Python, R code. R code will be copy and paste from Gelman's own solutions PDF. We might also include code from Guilherme Rocha (another Bayesian Statistics lecturer), or from Jeff Hart (Texas A&M), or from Brian Junker (Carnegie Mellon University).

To install R on ubuntu, apt-get install r-base and r-base-dev.

You can run the R code from command line using "R -f [file]". If there are plots displayed in the R code, they will be dumped into a file called Rplots.pdf in your current directory.

Github

Tuesday, August 17, 2010

Sampling With Replacement Using Weights in Python

Here is the Python function corresponding to sample() call in R. We based it on the code here; only changed it so that the inputs use seperate weight and value vectors instead of one vector that has tuples of weight, value pairs.
import random

items = [(10, "low"),
(100, "mid"),
(890, "large")]

w = [10, 100, 890]
v = ["low", "mid", "large"]

def weighted_sample(ws, vs, n):
total = float(sum(w for w in ws))
i = 0
w = ws[0]
v = vs[0]
while n:
x = total * (1 - random.random() ** (1.0 / n))
total -= x
while x > w:
x -= w
i += 1
w = ws[i]
v = vs[i]
w -= x
yield v
n -= 1

for i in weighted_sample(w, v, 500):
print i

Saturday, July 17, 2010

Bayes Election Prediction R Code, Gelman, Tokdar

Attached is the R code that calculates a prediction for 1992 presidential election based on methods found in Andrew Gelman's Bayesian Data Analysis book. The R code itself is written by Dr. Surya Tokdar; however the links at CMU that housed this and his other lecture related materials were all gone. I salvaged what I could from Google cache, removed all plotting, graphics related portions of the code, leaving only MCMC Gibbs calculations intact. Data was also missing at the CMU link, so I recreated both data files based on Andrew Gelman's original presidential.asc file. Still, the R chol() call (for Cholesky decomposition) gave errors, then I replaced all NA values with 0's. Code finally worked with this recent fix, and its results seem correct -- a Democrat win is predicted for most states at year 1992 which actually was the case. The code and data files can be found below. Let me know of your questions and comments.

Code

Monday, July 5, 2010

Change Point Analysis using MCMC Gibbs Sampling on Coal Mining Data (in Python)

The code is here. This analysis is performed on British coal mining accident data, which is included in the zip file as well. The function coal() performs change point analysis using MCMC Gibbs sampling which models the data using two Poisson distributions. Change point analysis finds the point k when the second Poisson distribution, instead of the first comes into effect which means a change in regulations, etc. at that point in time. I wrote the Python script looking at Peter Neil's R code from Manchester University.

Also included is the same analysis performed using Bugs (JAGS). This code is adapted from Bayesian Computation with R book.

Sampling Indices From Weight Vector

Here is the code
def w_choice(lst):
n = random.uniform(0, 1)
for item, weight in enumerate(lst):
if n < weight:
break
n = n - weight
return item

# testing
prob = [0.1, 0.2, 0.5, 0.2]
print w_choice(prob)
This function will return the indeces from a sequence depending on the weights present in that sequence. We modified this code here to get the function above.

The code above assumes normalized weight vector, that is, all probability values in the vector should add to one. If the parameter passed into w_choice is not normalized, then this normalization can be performed with a single line of Python code at the beginning of w_choice:
lst = lst / sum(lst)

Wednesday, June 9, 2010

Siftpy - Python, SIFT, siftpp

Here is a first version of a Python interface called siftpy for the excellent siftpp C++ code that is written by Andrea Vedaldi. SIFT algorithm is devised (and patented) by David Lowe, who has his own C implementation for the algorithm; Vedaldi's code is written from scratch without any dependencies.

Monday, June 7, 2010

My Correction for Particle Filter Cookbook

The Python code correction I sent to SciPy.org - sponsored by Enthought - for particle filter has been made available by Alexander Borghgraef. The main page for the code is here.

Sunday, May 9, 2010

nosql

My comments in response to nosql episode of Command Line podcast

There were few mistakes on perception of nosql databases; First of all, the advantage of nosql is not that it does something SQL databases "cannot do". It does distribution of data out of the box, that is, it is so simplified, ingrained in the product that you don't even think twice about them. But with SQL databases, sharding, distribution is an afterthought. Not that you cannot DO these with SQL databases, it's just that with nosql these tasks are SIMPLER. Included in the product from day one.

There are pedagocial issues at play here, which are almost as important as technological ones.

Same is true for basic CRUD operations. They are SIMPLER with nosql than they are with sql dbs. With Google Bigtable, I define Model classes in Python, send them over to the cloud, and I _have_ a database. Following through pointers, as in order.owner.address.street is very simple to do, and built-in, in contrast to SQL databases where you have to use something like Hibernate to achive the same result.

Plus, nosql makes you concious of sharding of data from day one; since joins are discouraged, you think distribution, and you have to think big. Sure, for small Web sites, small # of users you can use one database, and keep using joins, but you can also use one nosql shard, and use LESS complicated query (meaning no joins) and achieve same result.

Wednesday, March 24, 2010

Kalman Filter in Python

The attached Kalman filter code is based on Python example found in book Machine Learning: An Algorithmic Perspective by Stephen Marsland. We simply turned the code into a class which is able to keep its state across invocations, and therefore work in an online fashion. The example code was assuming all data is available before everything started, and ran everything in one mainline.

Thursday, January 21, 2010

Porting code from Matlab / Octave to Python Numpy

Things to watch out for while porting code from Octave / Matlab to Python Numpy
  • The V returned from Numpy U, D, V = svd is not the same V in M / O. In order to access the equivalent, you need to do V.T in Python.
  • Instead of a = [2 3 4] you use a = [2, 3, 4]
  • Use * in place of .*
  • The -1 in reshape means I don't care what you do with rows, just calculate everything according to column parameter.
  • D returned from svd call is not in diagonal matrix form. It is simple a vector of values that form the diagonal of Matlab's D. This is most likely done for efficiency reasons, and it makes sense. If you need this data in diagonal form, simple call Numpy diag(D) it will form the square matrix for you.
  • Use dot() instead of *
  • Don't forget Matlab / Octave use 1-based indexing of arrays where Python uses 0. So x(3, :) in Matlab / Octave would become x[2, :].
  • Oh, [] instead of (). Of course.
  • ** instead of .^
  • Instead of find() you just write the filter condition directly on the matrix, vector var itself, but then you need to call nonzero() to get index values, otherwise you get True, False values. The call ind = find(abs(x(3,:)) > bla) becomes ind = (abs(x[2, :]) > bla).nonzero()
  • The constant eps is not defined, I simple hardcode it globally eps=1e-15
  • Instead of special index value 'end', you have to use negative index value -1. Same for end-1, end-2. Things get a bit confusing however, when there is a "range" involved, such as all columns including the last one. In that case, you don't use -1 at all, just leave the index blank.
    Example: a = array([[1,3,8],[2,4,0],[9,9,9]])
    print a[:,-1] gives [8 0 9]
    print a[:,1:] gives [[3 8] [4 0][9 9]]

Reference