A Scratchpad: 2013

Thursday, September 5, 2013

Google Paper and Data Science Tips

Great paper by Google with great tips.

Thursday, August 22, 2013

What Hackers Should Know About Machine Learning

Link

Data analysis as an exploratory endeavor should be the first part of anything. You should never go into a project and say “The thing that I want to do is classification so I'm always going to run my favorite classification algorithm.” For the first half of the book we talk about “Here's a dataset, here's how to clean it up.” The chapters that John Miles White wrote on means, medians, modes, and distributions are always the things that you should do in the beginning. We want to hammer home that it's not just input-output. Input, look around, see what's going on, find structure in the data, then make the choice for methods. And then maybe iterate a couple of them. It's very cyclic. It's not linear [..]

My thinking has evolved on presenting results. The way I think about presenting results now is always in the browser as an interactive thing. There's a tremendous amount of value in providing the audience with the ability to ask second-order questions about what they are observing rather than first-order ones. Imagine the thing you are looking at is just a simple scatterplot and you see one outlier. So a first-order question would be who is that outlier? If you have an interactive thing where you can go over the dot and it tells you who that is, and the second order question is why is that an outlier?

Monday, August 5, 2013

10 Best Practices in Operational Analytics

Great set of slides on ensembles, feature engineering, data preperation.

Wednesday, July 10, 2013

Data Agnosticism: Feature Engineering Without Domain Expertise

Thursday, May 2, 2013

Getting to know your data

Witten, Data Mining, Practical Machine Learning Tools and Techniques, pg 60

There is no substitute for getting to know your data. Simple tools that show histograms of the distribution of values of nominal attributes, and graphs of the values of numeric attributes (perhaps sorted or simply graphed against instance number), are very helpful. These graphical visualizations of the data make it easy to identify outliers, which may well represent errors in the data file—or arcane conventions for coding unusual situations, such as a missing year as 9999 or a missing weight as -1 kg, that no one has thought to tell you about. Domain experts need to be consulted to explain anomalies, missing values, the significance of integers that represent categories rather than numeric quantities, and so on. Pairwise plots of one attribute against another, or each attribute against the class value, can be extremely revealing.

Data cleaning is a time-consuming and labor-intensive procedure but one that is absolutely necessary for successful data mining. With a large dataset, people often give up—how can they possibly check it all? Instead, you should sample a few instances and examine them carefully. You’ll be surprised at what you find. Time looking at your data is always well spent.

Sunday, March 17, 2013

Data Science

Typical data science analysis'

Recommendation engines – increase cross-sell and repeat purchases by identifying other products in which a customer or prospect is likely to be interested

Web analytics - advanced click-stream, golden path analysis, viewer engagement, segmentation, and more.

Cross-channel marketing attribution – move beyond the skewed input of last click analysis to accurately determine campaign impact effectiveness across all channels

Influencer analysis – understand whose actions have impact in the network to encourage the behavior of peers for purchases, attrition, or just engagement.

Wednesday, March 6, 2013

Practical machine learning tricks - KDD 2011

Link

At first glance, this might appear to be a "Hello-World" machine learning problem straight out of a textbook or tutorial: we simply train a Naive Bayes on a set of bad ads versus a set of good ones. However this is apparently far from being the case - while Google is understandably shy about hard numbers, the paper mentions several issues which make this especially challenging and notes that this is a business-critical problem for Google.

--

There are many useful suggestions in this post.