Tuesday, November 13, 2012

The Data Science Loop


Link

Ask a good question.

Answer the question while economizing on resources.

Communicate your results.

(Sometimes) Make recommendations to engineers or managers.

Asking a good question is probably the hardest thing to get right. If you neglect this step, you'll spend days of your life working on something that will have little impact. It's a skill that people who focus on technical training tend to be bad at [..].

The real art to asking good questions is to consider your audience. Who is going to be interested in the results and why are they going to care? I find that the best questions have punchy answers, are usually interesting to everyone, and usually affect a potential decision. On the last point, the key is to think about how someone within your organization might change their strategy due to your answer.

Effectively answering questions is where technical skills become important. It's easy to get caught up in fancy algorithms and methods, but those approaches are usually premature optimizations. The best answers are 1) cheap and 2) easy to explain. Give me a table of counts or event rates over regression coefficients or the first eigenvector of your matrix decomposition. Perhaps it's a bit modest, but I often describe data science as "advanced applied counting." [..]

Fancy, new, and complicated are usually bad qualities for a method. Take it from Jay Kreps, "read current 3-5 pubs and note the stupid simple thing they all claim to beat, implement that."

The other pattern I notice here is the unreasonable effectiveness of Polya's advice for solving a math problem, particularly this aphorism: "If you can't solve a problem, then there is an easier problem you can solve: find it." Paraphrased for data scientists, if there is a question you can't answer, there is an easier question you can answer (usually counting something!).

I firmly believe that data scientists should not be engineers or managers. Engineers build things, managers make decisions, data scientists answer questions. This is not to trivialize the role of data scientists, who plausibly account 2/3 of the steps in the build-measure-learn loop. The answers can (and should) inform decisions that managers make and help engineers build better products, but answers always lead to more (and better!) questions.

Don't let the data science technical jargon drive your impression of what is actually done in the field. In my experience, it's a research job where you have autonomy to ask and answer some really interesting questions. The fundamental challenge is being savvy enough to pick good questions and find concise answers using minimal resources. Then you must convince everyone to listen to you about what you found. In many ways it's similar to academic research, but the differences are that the cycle is tighter and your answers will often effect changes in the business.

Wednesday, August 15, 2012

Using T-Mobile USB Modem on Ubuntu in Germany

The modem is Mobilcom Debitel.

First install

sudo apt-get install usb-modeswitch usb-modeswitch-data wvdial

Your /etc/wvdial.conf should contain

[Dialer Defaults]
Phone = *99#
Username = t-mobile
Password = tm
Stupid Mode = 1
Dial Command = ATDT
Modem = /dev/ttyUSB2

[Dialer tmo]
Modem = /dev/ttyUSB2
Baud = 460800
Init1 = ATZ
Init2 = ATQ0 V1 E1 S0=0 &C1 &D2 +FCLASS=0
ISDN = 0
Modem Type = Analog Modem

Type

usb-devices

Look at the list and find out vendor and product id. They will be used for -v and -p respectively.

sudo usb_modeswitch -v [VENDOR] -p [PROD ID] -M '55534243123456780000000080000606f50402527000000000000000000000'

sudo modprobe option

echo "1c9e [PRODUCT]" | sudo tee /sys/bus/usb-serial/drivers/option1/new_id

Now

sudo wvdial tmo

A couple of times I had to do this twice, at these times a dialogbox would open and I had to enter my (T-mobile) pin, and it said it "unlocked" the pin; after that, I didnt have to do it again.

Some postings on the Internet suggest going into editing Ubuntu network connections, adding a connection for Mobile Broadband (seperate tab next to Wireless Network), and settings things there. I did not need this, using the commands above seem to suffice.

Saturday, March 3, 2012

Skillicorn Data Mining Book Matlab Code, Data

We are trying to collect all relevant data and code for Skillicorn's Understanding Complex Datasets with Matrix Decomposition book. We follow the links shared in the bibliography and get relevant code, data when possible. The ones we found are in the zip below, it will grow as we find more.

Link

Wednesday, December 21, 2011

Automatic PDF Form Filler

Filling out forms is one of my least favorite activities; especially for programmers / IT people who are used to doing everything electronically, looking at that form with pen in hand somehow brings time to a crawl. The boxes are always too small, if there are mistakes, you need to reprint, and repeat the whole thing again. By hand.

Here is a collection of Python scripts that will allow you fill out a PDF form automatically. First, you need to convert the PDF to a collection of jpgs, using

python convert.py DOC.pdf [target dir]

In [target dir] you should now see DOC-0.jpg, DOC-1.jpg, etc.

Then, you need to identify box locations. For that use locs.py

python locs.py [target dir]/DOC-0.jpg

This brings up a UI tool; as you click on boxes, the coordinates of those boxes will be written to a [target dir]/DOC-0.jpg.loc file. Make sure you click on the boxes in a logical order, most forms specify a number on the page for each box anyway, use that order for instance.The coordinates are written to the loc file as you click, so once you are done, simply shut off locs.py.

Now in [target dir], start a new file called DOC-0.jpg.fill

This file will carry the values to be used to fill our your PDF form. Each line in this file should correspond to the line specified in DOC-0.jpg.loc. The line orders must match. You can manually tell fill.py to skip pixels in up or down direction by using e.g.

[down=40]bla bla bla

You can also use up, left, right commands. If you need to change the font size, e.g. for size 20 use [font=20].

Once that is done,

python fill.py [target dir]/DOC-0.jpg

This will use the loc file, fill file, and generate a final DOC-0.jpg-out.jpg

In this file you will see stuff from fill file placed in proper coordinates.

This tool uses ImageMagick, so make sure you install that first. Also, for the necessary Python libraries on Ubuntu you can use

sudo apt-get install python python-tk idle python-pmw python-imaging python-imaging-tk

An improvement to this code could be using a vision algorithm to automatically detect the location of each box. There is a certain visual pattern to a form -- words are in straight lines, there are big empty spaces in between, and the whole thing is usually surrounded by lines.

Download