Tuesday, December 18, 2012

Geometry, Machine Learning and Deep Learning


Over the years [..] the way we construct mental models of data has changed. And as I've argued before, understanding how we think about data, and what shape we give it, is key to the whole enterprise of finding patterns in data.

The model that one always starts with is Euclidean space. Data = points, features = dimensions, and so on. And as a first approximation of a data model, it isn't terrible.

There are many ways to modify this space. You can replace the ℓ2 norm by ℓ1. You can normalize the points (again with ℓ2 or ℓ1, sending you to the sphere or the simplex). You can weight the dimensions, or even do a wholesale scale-rotation.

But that's not all. Kernels take this to another level. You can encode weak nonlinearity in the data by assuming that it's flat once you lift it. In a sense, this is still an ℓ2 space, but a larger class of spaces that you can work with. The entire SVM enterprise was founded on this principle.

But that's not all either. The curse of dimensionality means that it's difficult to find patterns in such high dimensional data. Arguably, "real data" is in fact NOT high dimensional, or is not generated by a process with many parameters, and so sparsity-focused methods like compressed sensing start playing a role.

But it gets even more interesting. Maybe the data is low-dimensional, but doesn't actually lie in a subspace. This gets you into manifold learning and variants: the data lies on a low-dimensional curved sheet of some kind, and you need to learn
on that space.

While the challenge for geometry (and algorithms) is to keep up with the new data models, the challenge for data analysts is to design data models that are realistic and workable.

So what does this have to do with deep learning ?

Deep learning networks "work" in that they appear to be able to identify interesting semantic structures in data that can be quite noisy. But to me it's not entirely clear why that is [..].

A central idea of [Deep Learning] work is that deep belief networks can be trained "layer by layer", where each layer uses features identified from the previous layer.

If you stare at these things long enough, you begin to see a picture not of sparse data, or low-rank data, or even manifold data. What you see is a certain hierarchical collection of subspaces, where low-dimensional spaces interact in a low dimensional way to form higher level spaces, and so on. So you might have a low-level "lip" feature described by a collection of 2-3 dimensional noisy subspaces in an image space. These "lip" features in turn combine with "eye" features and so on.

Tuesday, November 13, 2012

The Data Science Loop


Ask a good question.

Answer the question while economizing on resources.

Communicate your results.

(Sometimes) Make recommendations to engineers or managers.

Asking a good question is probably the hardest thing to get right. If you neglect this step, you'll spend days of your life working on something that will have little impact. It's a skill that people who focus on technical training tend to be bad at [..].

The real art to asking good questions is to consider your audience. Who is going to be interested in the results and why are they going to care? I find that the best questions have punchy answers, are usually interesting to everyone, and usually affect a potential decision. On the last point, the key is to think about how someone within your organization might change their strategy due to your answer.

Effectively answering questions is where technical skills become important. It's easy to get caught up in fancy algorithms and methods, but those approaches are usually premature optimizations. The best answers are 1) cheap and 2) easy to explain. Give me a table of counts or event rates over regression coefficients or the first eigenvector of your matrix decomposition. Perhaps it's a bit modest, but I often describe data science as "advanced applied counting." [..]

Fancy, new, and complicated are usually bad qualities for a method. Take it from Jay Kreps, "read current 3-5 pubs and note the stupid simple thing they all claim to beat, implement that."

The other pattern I notice here is the unreasonable effectiveness of Polya's advice for solving a math problem, particularly this aphorism: "If you can't solve a problem, then there is an easier problem you can solve: find it." Paraphrased for data scientists, if there is a question you can't answer, there is an easier question you can answer (usually counting something!).

I firmly believe that data scientists should not be engineers or managers. Engineers build things, managers make decisions, data scientists answer questions. This is not to trivialize the role of data scientists, who plausibly account 2/3 of the steps in the build-measure-learn loop. The answers can (and should) inform decisions that managers make and help engineers build better products, but answers always lead to more (and better!) questions.

Don't let the data science technical jargon drive your impression of what is actually done in the field. In my experience, it's a research job where you have autonomy to ask and answer some really interesting questions. The fundamental challenge is being savvy enough to pick good questions and find concise answers using minimal resources. Then you must convince everyone to listen to you about what you found. In many ways it's similar to academic research, but the differences are that the cycle is tighter and your answers will often effect changes in the business.

Wednesday, August 15, 2012

Using T-Mobile USB Modem on Ubuntu in Germany

The modem is Mobilcom Debitel.

First install

sudo apt-get install usb-modeswitch usb-modeswitch-data wvdial

Your /etc/wvdial.conf should contain

[Dialer Defaults]
Phone = *99#
Username = t-mobile
Password = tm
Stupid Mode = 1
Dial Command = ATDT
Modem = /dev/ttyUSB2

[Dialer tmo]
Modem = /dev/ttyUSB2
Baud = 460800
Init1 = ATZ
Init2 = ATQ0 V1 E1 S0=0 &C1 &D2 +FCLASS=0
ISDN = 0
Modem Type = Analog Modem



Look at the list and find out vendor and product id. They will be used for -v and -p respectively.

sudo usb_modeswitch -v [VENDOR] -p [PROD ID] -M '55534243123456780000000080000606f50402527000000000000000000000'

sudo modprobe option

echo "1c9e [PRODUCT]" | sudo tee /sys/bus/usb-serial/drivers/option1/new_id


sudo wvdial tmo

A couple of times I had to do this twice, at these times a dialogbox would open and I had to enter my (T-mobile) pin, and it said it "unlocked" the pin; after that, I didnt have to do it again.

Some postings on the Internet suggest going into editing Ubuntu network connections, adding a connection for Mobile Broadband (seperate tab next to Wireless Network), and settings things there. I did not need this, using the commands above seem to suffice.

Saturday, March 3, 2012

Skillicorn Data Mining Book Matlab Code, Data

We are trying to collect all relevant data and code for Skillicorn's Understanding Complex Datasets with Matrix Decomposition book. We follow the links shared in the bibliography and get relevant code, data when possible. The ones we found are in the zip below, it will grow as we find more.