A Scratchpad

Dr. Shalizi's Book

2016-02-03T04:00:00.004-08:00

Being mentioned in the Acknowledgements section of Dr. Cosma Shalizi's excellent book Advanced Data Analysis from An Elementary Point of View (link). It is an honor! I'd found a small mistake in a formula and informed Shalizi about it.

Some Tutorials in Turkish

2015-12-25T02:52:00.012-08:00

Lineer Cebir (Linear Algebra)

Diferansiyel Denklemler (Ordinary Differential Equations)

Çok Değişkenli Calculus (Multivariable Calculus)

Hesapsal Bilim (Computational Science)

İstatistik, Yapay Öğrenim, Veri Analizi (Statistics, Machine Learning, Data Analysis)

Zaman Serileri ve Finans (Time Series and Finance)

Kısmi Diferansiyel Denklemler (Partial Differential Equations)

Fonksiyonel Analiz (Functional Analysis)

Bilgisayar Bilim, Yapay Zeka (Computer Science, AI)

Gayri Lineer Dinamik ve Kaos (Non-Linear Dynamics and Chaos)

Yapay Görüş (Computer Vision)

Fizik

IT, Bilisim

Backtesting

2015-12-18T00:32:00.003-08:00

For stock trading one usually needs a backtesting framework. I prefer Python; and here is a comprehensive list,

Link

Heard about this link from here - the author was announcing his own backtester.

I just played with pyalgotrade, and it looks good.

Python Code for the Algorithmic Trading Book

2015-09-21T00:48:00.002-07:00

I converted some of the code for Dr. Ernie Chan's Algorithmic Trading book into Python. It is open-sourced here. Dr. Chan's mention about our project is here (the end of the post).

Data Science Done Well Looks Easy

2015-03-18T03:29:00.001-07:00

Link

Data science has a ton of different definitions. For the purposes of this post I'm going to use the definition of data science we used when creating our Data Science program online. Data science is:

Data science is the process of formulating a quantitative question that can be answered with data, collecting and cleaning the data, analyzing the data, and communicating the answer to the question to a relevant audience [..].

A good data science project answers a real scientific or business analytics question. In almost all of these experiments the vast majority of the analyst's time is spent on getting and cleaning the data (steps 2-3) and communication and reproducibility (6-7). In most cases, if the data scientist has done her job right the statistical models don't need to be incredibly complicated to identify the important relationships the project is trying to find. In fact, if a complicated statistical model seems necessary, it often means that you don't have the right data to answer the question you really want to answer. One option is to spend a huge amount of time trying to tune a statistical model to try to answer the question but serious data scientist's usually instead try to go back and get the right data.

The result of this process is that most well executed and successful data science projects don't (a) use super complicated tools or (b) fit super complicated statistical models. The characteristics of the most successful data science projects I've evaluated or been a part of are: (a) a laser focus on solving the scientific problem, (b) careful and thoughtful consideration of whether the data is the right data and whether there are any lurking confounders or biases and (c) relatively simple statistical models applied and interpreted skeptically.

It turns out doing those three things is actually surprisingly hard and very, very time consuming. It is my experience that data science projects take a solid 2-3 times as long to complete as a project in theoretical statistics. The reason is that inevitably the data are a mess and you have to clean them up, then you find out the data aren't quite what you wanted to answer the question, so you go find a new data set and clean it up, etc. After a ton of work like that, you have a nice set of data to which you fit simple statistical models and then it looks super easy to someone who either doesn't know about the data collection and cleaning process or doesn't care.

This poses a major public relations problem for serious data scientists. When you show someone a good data science project they almost invariably think "oh that is easy" or "that is just a trivial statistical/machine learning model" and don't see all of the work that goes into solving the real problems in data science.

Movielens, Funk SVD, Numba

2014-11-12T01:29:00.000-08:00

The Python version of Funk SVD coded with Numba (to execute at C speeds) of can be found here.

emacs-ipython

2014-10-16T00:44:00.000-07:00

Here is an Emacs extension emacs-ipython that allows one to execute ipython code snippets from inside Emacs LaTeX buffer and display the results (as graphics or text) directly in the same buffer. The mode was developed to avoid the ipython notebook Web interface and ipynb files which are uneditable through simple text editors. This way best of both worlds is used, Emacs for editing TeX, and ipython for running code.

Google Paper and Data Science Tips

2013-09-05T08:06:00.003-07:00

Great paper by Google with great tips.

What Hackers Should Know About Machine Learning

2013-08-22T07:17:00.001-07:00

Link

Data analysis as an exploratory endeavor should be the first part of anything. You should never go into a project and say “The thing that I want to do is classification so I'm always going to run my favorite classification algorithm.” For the first half of the book we talk about “Here's a dataset, here's how to clean it up.” The chapters that John Miles White wrote on means, medians, modes, and distributions are always the things that you should do in the beginning. We want to hammer home that it's not just input-output. Input, look around, see what's going on, find structure in the data, then make the choice for methods. And then maybe iterate a couple of them. It's very cyclic. It's not linear [..]

My thinking has evolved on presenting results. The way I think about presenting results now is always in the browser as an interactive thing. There's a tremendous amount of value in providing the audience with the ability to ask second-order questions about what they are observing rather than first-order ones. Imagine the thing you are looking at is just a simple scatterplot and you see one outlier. So a first-order question would be who is that outlier? If you have an interactive thing where you can go over the dot and it tells you who that is, and the second order question is why is that an outlier?

10 Best Practices in Operational Analytics

2013-08-05T00:42:00.003-07:00

Great set of slides on ensembles, feature engineering, data preperation.

Data Agnosticism: Feature Engineering Without Domain Expertise

2013-07-10T04:51:00.001-07:00

Getting to know your data

2013-05-02T13:46:00.002-07:00

Witten, Data Mining, Practical Machine Learning Tools and Techniques, pg 60

There is no substitute for getting to know your data. Simple tools that show histograms of the distribution of values of nominal attributes, and graphs of the values of numeric attributes (perhaps sorted or simply graphed against instance number), are very helpful. These graphical visualizations of the data make it easy to identify outliers, which may well represent errors in the data file—or arcane conventions for coding unusual situations, such as a missing year as 9999 or a missing weight as -1 kg, that no one has thought to tell you about. Domain experts need to be consulted to explain anomalies, missing values, the significance of integers that represent categories rather than numeric quantities, and so on. Pairwise plots of one attribute against another, or each attribute against the class value, can be extremely revealing.

Data cleaning is a time-consuming and labor-intensive procedure but one that is absolutely necessary for successful data mining. With a large dataset, people often give up—how can they possibly check it all? Instead, you should sample a few instances and examine them carefully. You’ll be surprised at what you find. Time looking at your data is always well spent.

Data Science

2013-03-17T04:33:00.000-07:00

Typical data science analysis'

Recommendation engines – increase cross-sell and repeat purchases by identifying other products in which a customer or prospect is likely to be interested

Web analytics - advanced click-stream, golden path analysis, viewer engagement, segmentation, and more.

Cross-channel marketing attribution – move beyond the skewed input of last click analysis to accurately determine campaign impact effectiveness across all channels

Influencer analysis – understand whose actions have impact in the network to encourage the behavior of peers for purchases, attrition, or just engagement.

Practical machine learning tricks - KDD 2011

2013-03-06T03:15:00.005-08:00

Link

At first glance, this might appear to be a "Hello-World" machine learning problem straight out of a textbook or tutorial: we simply train a Naive Bayes on a set of bad ads versus a set of good ones. However this is apparently far from being the case - while Google is understandably shy about hard numbers, the paper mentions several issues which make this especially challenging and notes that this is a business-critical problem for Google.

--

There are many useful suggestions in this post.

Geometry, Machine Learning and Deep Learning

2012-12-18T09:36:00.000-08:00

Link

Over the years [..] the way we construct mental models of data has changed. And as I've argued before, understanding how we think about data, and what shape we give it, is key to the whole enterprise of finding patterns in data.

The model that one always starts with is Euclidean space. Data = points, features = dimensions, and so on. And as a first approximation of a data model, it isn't terrible.

There are many ways to modify this space. You can replace the ℓ2 norm by ℓ1. You can normalize the points (again with ℓ2 or ℓ1, sending you to the sphere or the simplex). You can weight the dimensions, or even do a wholesale scale-rotation.

But that's not all. Kernels take this to another level. You can encode weak nonlinearity in the data by assuming that it's flat once you lift it. In a sense, this is still an ℓ2 space, but a larger class of spaces that you can work with. The entire SVM enterprise was founded on this principle.

But that's not all either. The curse of dimensionality means that it's difficult to find patterns in such high dimensional data. Arguably, "real data" is in fact NOT high dimensional, or is not generated by a process with many parameters, and so sparsity-focused methods like compressed sensing start playing a role.

But it gets even more interesting. Maybe the data is low-dimensional, but doesn't actually lie in a subspace. This gets you into manifold learning and variants: the data lies on a low-dimensional curved sheet of some kind, and you need to learn
on that space.

While the challenge for geometry (and algorithms) is to keep up with the new data models, the challenge for data analysts is to design data models that are realistic and workable.

So what does this have to do with deep learning ?

Deep learning networks "work" in that they appear to be able to identify interesting semantic structures in data that can be quite noisy. But to me it's not entirely clear why that is [..].

A central idea of [Deep Learning] work is that deep belief networks can be trained "layer by layer", where each layer uses features identified from the previous layer.

If you stare at these things long enough, you begin to see a picture not of sparse data, or low-rank data, or even manifold data. What you see is a certain hierarchical collection of subspaces, where low-dimensional spaces interact in a low dimensional way to form higher level spaces, and so on. So you might have a low-level "lip" feature described by a collection of 2-3 dimensional noisy subspaces in an image space. These "lip" features in turn combine with "eye" features and so on.

The Data Science Loop

2012-11-13T23:56:00.000-08:00

Link

Ask a good question.

Answer the question while economizing on resources.

Communicate your results.

(Sometimes) Make recommendations to engineers or managers.

Asking a good question is probably the hardest thing to get right. If you neglect this step, you'll spend days of your life working on something that will have little impact. It's a skill that people who focus on technical training tend to be bad at [..].

The real art to asking good questions is to consider your audience. Who is going to be interested in the results and why are they going to care? I find that the best questions have punchy answers, are usually interesting to everyone, and usually affect a potential decision. On the last point, the key is to think about how someone within your organization might change their strategy due to your answer.

Effectively answering questions is where technical skills become important. It's easy to get caught up in fancy algorithms and methods, but those approaches are usually premature optimizations. The best answers are 1) cheap and 2) easy to explain. Give me a table of counts or event rates over regression coefficients or the first eigenvector of your matrix decomposition. Perhaps it's a bit modest, but I often describe data science as "advanced applied counting." [..]

Fancy, new, and complicated are usually bad qualities for a method. Take it from Jay Kreps, "read current 3-5 pubs and note the stupid simple thing they all claim to beat, implement that."

The other pattern I notice here is the unreasonable effectiveness of Polya's advice for solving a math problem, particularly this aphorism: "If you can't solve a problem, then there is an easier problem you can solve: find it." Paraphrased for data scientists, if there is a question you can't answer, there is an easier question you can answer (usually counting something!).

I firmly believe that data scientists should not be engineers or managers. Engineers build things, managers make decisions, data scientists answer questions. This is not to trivialize the role of data scientists, who plausibly account 2/3 of the steps in the build-measure-learn loop. The answers can (and should) inform decisions that managers make and help engineers build better products, but answers always lead to more (and better!) questions.

Don't let the data science technical jargon drive your impression of what is actually done in the field. In my experience, it's a research job where you have autonomy to ask and answer some really interesting questions. The fundamental challenge is being savvy enough to pick good questions and find concise answers using minimal resources. Then you must convince everyone to listen to you about what you found. In many ways it's similar to academic research, but the differences are that the cycle is tighter and your answers will often effect changes in the business.

Using T-Mobile USB Modem on Ubuntu in Germany

2012-08-15T14:10:00.003-07:00

The modem is Mobilcom Debitel.

First install

sudo apt-get install usb-modeswitch usb-modeswitch-data wvdial

Your /etc/wvdial.conf should contain

[Dialer Defaults]

Phone = *99#

Username = t-mobile

Password = tm

Stupid Mode = 1

Dial Command = ATDT

Modem = /dev/ttyUSB2

[Dialer tmo]

Modem = /dev/ttyUSB2

Baud = 460800

Init1 = ATZ

Init2 = ATQ0 V1 E1 S0=0 &C1 &D2 +FCLASS=0

ISDN = 0

Modem Type = Analog Modem

Type

usb-devices

Look at the list and find out vendor and product id. They will be used for -v and -p respectively.

sudo usb_modeswitch -v [VENDOR] -p [PROD ID] -M '55534243123456780000000080000606f50402527000000000000000000000'

sudo modprobe option

echo "1c9e [PRODUCT]" | sudo tee /sys/bus/usb-serial/drivers/option1/new_id

Now

sudo wvdial tmo

A couple of times I had to do this twice, at these times a dialogbox would open and I had to enter my (T-mobile) pin, and it said it "unlocked" the pin; after that, I didnt have to do it again.

Some postings on the Internet suggest going into editing Ubuntu network connections, adding a connection for Mobile Broadband (seperate tab next to Wireless Network), and settings things there. I did not need this, using the commands above seem to suffice.

Skillicorn Data Mining Book Matlab Code, Data

2012-03-03T05:17:00.002-08:00

We are trying to collect all relevant data and code for Skillicorn's Understanding Complex Datasets with Matrix Decomposition book. We follow the links shared in the bibliography and get relevant code, data when possible. The ones we found are in the zip below, it will grow as we find more.

Link

Automatic PDF Form Filler

2011-12-21T14:44:00.001-08:00

Filling out forms is one of my least favorite activities; especially for programmers / IT people who are used to doing everything electronically, looking at that form with pen in hand somehow brings time to a crawl. The boxes are always too small, if there are mistakes, you need to reprint, and repeat the whole thing again. By hand.

Here is a collection of Python scripts that will allow you fill out a PDF form automatically. First, you need to convert the PDF to a collection of jpgs, using

python convert.py DOC.pdf [target dir]

In [target dir] you should now see DOC-0.jpg, DOC-1.jpg, etc.

Then, you need to identify box locations. For that use locs.py

python locs.py [target dir]/DOC-0.jpg

This brings up a UI tool; as you click on boxes, the coordinates of those boxes will be written to a [target dir]/DOC-0.jpg.loc file. Make sure you click on the boxes in a logical order, most forms specify a number on the page for each box anyway, use that order for instance.The coordinates are written to the loc file as you click, so once you are done, simply shut off locs.py.

Now in [target dir], start a new file called DOC-0.jpg.fill

This file will carry the values to be used to fill our your PDF form. Each line in this file should correspond to the line specified in DOC-0.jpg.loc. The line orders must match. You can manually tell fill.py to skip pixels in up or down direction by using e.g.

[down=40]bla bla bla

You can also use up, left, right commands. If you need to change the font size, e.g. for size 20 use [font=20].

Once that is done,

python fill.py [target dir]/DOC-0.jpg

This will use the loc file, fill file, and generate a final DOC-0.jpg-out.jpg

In this file you will see stuff from fill file placed in proper coordinates.

This tool uses ImageMagick, so make sure you install that first. Also, for the necessary Python libraries on Ubuntu you can use

sudo apt-get install python python-tk idle python-pmw python-imaging python-imaging-tk

An improvement to this code could be using a vision algorithm to automatically detect the location of each box. There is a certain visual pattern to a form -- words are in straight lines, there are big empty spaces in between, and the whole thing is usually surrounded by lines.

Download

Optflow C++

2011-10-27T04:17:00.000-07:00

Here is a slimmed down mainline C++ code that uses Seppo Pulkkinen's optflow library. This library uses CImg internally.


#include "CImg_config.h"
#include <CImg.h>
#include <sstream>
#include <string>

#include "DenseVectorFieldIO.h"
#include "DualDenseMotionExtractor.h"
#include "PyramidalLucasKanade.h"
#include "SparseVectorFieldIO.h"
#include "VectorFieldIllustrator.h"

using namespace cimg_library;

int main() {

  CImg< unsigned char > I1("../examples/test1.png");
  CImg< unsigned char > I2("../examples/test2.png");

  const int W = I1.dimx();
  const int H = I1.dimy();
  CImg< unsigned char > I1_smoothed;
  CImg< unsigned char > I2_smoothed;
  CImg< unsigned char > motionImageF(W, H, 1, 3);
  CImg< double > VF, VB;

  I1_smoothed = I1.get_channel(0);
  I2_smoothed = I2.get_channel(0);

  motionImageF.get_shared_channel(0) = I1_smoothed * 0.75;
  motionImageF.get_shared_channel(1) = I1_smoothed * 0.75;
  motionImageF.get_shared_channel(2) = I1_smoothed * 0.75;

  I1_smoothed.blur(3.0, 3.0, 3.0);
  I2_smoothed.blur(3.0, 3.0, 3.0);

  DenseMotionExtractor* e = new PyramidalLucasKanade(8,3,0.0025,0.0,4,true);
  e->compute(I1_smoothed, I2_smoothed, VF, VB);
  printf("%f\n",VF[100,100]);


  return 0;
}

To compile drop this file under lib, run make, create the so, then compile as

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:.
/usr/bin/c++ -L. -Doptflow_EXPORTS -fPIC -I. -Wall -O2 -frounding-math \
-loptflow -o main main.cpp

Mumford on Math

2011-10-06T03:11:00.000-07:00

"Mathematicians believe in this Platonic universe in that, there is a pre-existing bunch of facts which are true and you never invent anything, you are discovering".

Optical Flow, Lucas Kanade in Python

2011-10-01T05:06:00.000-07:00

Following is the Lucas Kanade optical flow algorithm in Python. We used it successfully on two png images, as well as through OpenCV to follow a point in successive frames. More details are at Github.

import numpy as np
import scipy.signal as si
from PIL import Image

def gauss_kern():
   h1 = 15
   h2 = 15
   x, y = np.mgrid[0:h2, 0:h1]
   x = x-h2/2
   y = y-h1/2
   sigma = 1.5
   g = np.exp( -( x**2 + y**2 ) / (2*sigma**2) );
   return g / g.sum()

def deriv(im1, im2):
   g = gauss_kern()
   Img_smooth = si.convolve(im1,g,mode='same')
   fx,fy=np.gradient(Img_smooth)  
   ft = si.convolve2d(im1, 0.25 * np.ones((2,2))) + \
       si.convolve2d(im2, -0.25 * np.ones((2,2)))
                 
   fx = fx[0:fx.shape[0]-1, 0:fx.shape[1]-1]  
   fy = fy[0:fy.shape[0]-1, 0:fy.shape[1]-1];
   ft = ft[0:ft.shape[0]-1, 0:ft.shape[1]-1];
   return fx, fy, ft

import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as si
from PIL import Image
import deriv
import numpy.linalg as lin

def lk(im1, im2, i, j, window_size) :
   fx, fy, ft = deriv.deriv(im1, im2)
   halfWindow = np.floor(window_size/2)
   curFx = fx[i-halfWindow-1:i+halfWindow,
              j-halfWindow-1:j+halfWindow]
   curFy = fy[i-halfWindow-1:i+halfWindow,
              j-halfWindow-1:j+halfWindow]
   curFt = ft[i-halfWindow-1:i+halfWindow,
              j-halfWindow-1:j+halfWindow]
   curFx = curFx.T
   curFy = curFy.T
   curFt = curFt.T

   curFx = curFx.flatten(order='F')
   curFy = curFy.flatten(order='F')
   curFt = -curFt.flatten(order='F')
 
   A = np.vstack((curFx, curFy)).T
   U = np.dot(np.dot(lin.pinv(np.dot(A.T,A)),A.T),curFt)
   return U[0], U[1]

Plotting a Complex Exponential

2011-08-22T07:00:00.001-07:00

We rewrote one of the MIT OCW 18.03 ODE Mathlets in Python. This mathlet was for plotting complex exponentials.

from pylab import *

from matplotlib.widgets import Slider



ax = subplot(121)

subplots_adjust(left=0.1, bottom=0.25)

l1, = plot(None,None, lw=2, color='red')

axis([-1, 1, -8, 8])

title ('$(a + bi)t$', color='blue')

grid()



ax = subplot(122)

subplots_adjust(left=0.1, bottom=0.25)

l2, = plot(None,None, lw=2, color='red')

axis([-3, 3, -3, 3])

title ('$e^{(a + bi)t}$', color='blue')

grid()



axcolor = 'lightgoldenrodyellow'

axa = axes([0.15, 0.1, 0.65, 0.03], axisbg=axcolor)

axb  = axes([0.15, 0.15, 0.65, 0.03], axisbg=axcolor)



slidera = Slider(axa, 'a', -1.0, 1.0, valinit=0)

sliderb = Slider(axb, 'b', -8.0, 8.0, valinit=0)



def update(val):

    a = slidera.val

    b = sliderb.val

    t = arange(-1.0, 1.0, 0.001)

    l1.set_xdata(t)

    l1.set_ydata((b/a)*t)



    t = arange(-3.0, 3.0, 0.001)

    l2.set_xdata(exp(a*t)*cos(b*t))

    l2.set_ydata(exp(a*t)*sin(b*t))

    draw()



slidera.on_changed(update)

sliderb.on_changed(update)



show()

Clustering, Image Segmentation, Eigenvectors and Python

2011-07-23T09:18:00.001-07:00

Here is example code for eigenvector based segmentation in Python. For more details, see code here.

import matplotlib.pyplot as plt
import numpy as np

Img = plt.imread("twoObj.jpg")
n = Img.shape[0]
Img2 = Img.flatten(order='C')
nn = Img2.shape[0]

A = np.zeros((nn,nn))

for i in range(nn):
 for j in range(nn):
     A[i,j]=np.exp(-((Img2[i]-Img2[j])**2))

V,D = np.linalg.eig(A)
V = np.real(V)
a = np.real(D[0])
print a

threshold = 0 # filter
a = np.reshape(a, (n,n))
Img[a<threshold] = 255
plt.imshow(Img)
plt.show()

Myers-Briggs Test in Javascript / Python

2011-04-09T04:01:00.001-07:00

Hunch.com apparently uses this method - Myers-Briggs Test is a psychology, profile evaluation system, expanded upon later by David Keirsey. We coded the evaluation scheme in Python, making a few additions. In the original version in David Keirsey book Please Understand Me II, the answer to each question is either A, or B, these choices in Python code as -1, +1 then sum up appropiate array values. Recent versions of this questionaire carry more (sometimes even five) choices. We noticed an additional 'neutral' choice can made an improvement, our version carries 3 answers. Eval code substitutes -1 for A, +1 for B, and 0 for neutral.

An example of the evaluation algorithm Keirsey uses in his book is above. We simply generate indexes that correspond to the columns seen above (answers arrive in a straight list, numbered from 1 to 70), then do the addition.

def calculate_mb(choices):

  new_choices = []

  for i in range(1,8):
      new_choices.append([int(choices[j-1]) for j in range(i,71,7) ])

      res = list("XXXX")

  ei = sum(new_choices[0])
  if ei < 0: res[0] = 'E'
  else: res[0] = 'I'

  sn = sum(new_choices[1]) + sum(new_choices[2])
  if sn < 0: res[1] = 'S'
  else: res[1] = 'N'

  tf = sum(new_choices[3]) + sum(new_choices[4])
  if tf < 0: res[2] = 'T'
  else: res[2] = 'F'

  jp = sum(new_choices[5]) + sum(new_choices[6])
  if jp < 0: res[3] = 'J'
  else: res[3] = 'P'

  logging.debug(choices)

  return str(''.join(res))

Another version, in HTML using Javascript, can be found here.