Thursday, September 5, 2013
Thursday, August 22, 2013
What Hackers Should Know About Machine Learning
Link
Data analysis as an exploratory endeavor should be the first part of anything. You should never go into a project and say “The thing that I want to do is classification so I'm always going to run my favorite classification algorithm.” For the first half of the book we talk about “Here's a dataset, here's how to clean it up.” The chapters that John Miles White wrote on means, medians, modes, and distributions are always the things that you should do in the beginning. We want to hammer home that it's not just input-output. Input, look around, see what's going on, find structure in the data, then make the choice for methods. And then maybe iterate a couple of them. It's very cyclic. It's not linear [..]
My thinking has evolved on presenting results. The way I think about presenting results now is always in the browser as an interactive thing. There's a tremendous amount of value in providing the audience with the ability to ask second-order questions about what they are observing rather than first-order ones. Imagine the thing you are looking at is just a simple scatterplot and you see one outlier. So a first-order question would be who is that outlier? If you have an interactive thing where you can go over the dot and it tells you who that is, and the second order question is why is that an outlier?
Data analysis as an exploratory endeavor should be the first part of anything. You should never go into a project and say “The thing that I want to do is classification so I'm always going to run my favorite classification algorithm.” For the first half of the book we talk about “Here's a dataset, here's how to clean it up.” The chapters that John Miles White wrote on means, medians, modes, and distributions are always the things that you should do in the beginning. We want to hammer home that it's not just input-output. Input, look around, see what's going on, find structure in the data, then make the choice for methods. And then maybe iterate a couple of them. It's very cyclic. It's not linear [..]
My thinking has evolved on presenting results. The way I think about presenting results now is always in the browser as an interactive thing. There's a tremendous amount of value in providing the audience with the ability to ask second-order questions about what they are observing rather than first-order ones. Imagine the thing you are looking at is just a simple scatterplot and you see one outlier. So a first-order question would be who is that outlier? If you have an interactive thing where you can go over the dot and it tells you who that is, and the second order question is why is that an outlier?
Monday, August 5, 2013
Wednesday, July 10, 2013
Thursday, May 2, 2013
Getting to know your data
Witten, Data Mining, Practical Machine Learning Tools and Techniques, pg 60
There is no substitute for getting to know your data. Simple tools that show histograms of the distribution of values of nominal attributes, and graphs of the values of numeric attributes (perhaps sorted or simply graphed against instance number), are very helpful. These graphical visualizations of the data make it easy to identify outliers, which may well represent errors in the data file—or arcane conventions for coding unusual situations, such as a missing year as 9999 or a missing weight as -1 kg, that no one has thought to tell you about. Domain experts need to be consulted to explain anomalies, missing values, the significance of integers that represent categories rather than numeric quantities, and so on. Pairwise plots of one attribute against another, or each attribute against the class value, can be extremely revealing.
Data cleaning is a time-consuming and labor-intensive procedure but one that is absolutely necessary for successful data mining. With a large dataset, people often give up—how can they possibly check it all? Instead, you should sample a few instances and examine them carefully. You’ll be surprised at what you find. Time looking at your data is always well spent.
There is no substitute for getting to know your data. Simple tools that show histograms of the distribution of values of nominal attributes, and graphs of the values of numeric attributes (perhaps sorted or simply graphed against instance number), are very helpful. These graphical visualizations of the data make it easy to identify outliers, which may well represent errors in the data file—or arcane conventions for coding unusual situations, such as a missing year as 9999 or a missing weight as -1 kg, that no one has thought to tell you about. Domain experts need to be consulted to explain anomalies, missing values, the significance of integers that represent categories rather than numeric quantities, and so on. Pairwise plots of one attribute against another, or each attribute against the class value, can be extremely revealing.
Data cleaning is a time-consuming and labor-intensive procedure but one that is absolutely necessary for successful data mining. With a large dataset, people often give up—how can they possibly check it all? Instead, you should sample a few instances and examine them carefully. You’ll be surprised at what you find. Time looking at your data is always well spent.
Sunday, March 17, 2013
Data Science
Typical data science analysis'
Recommendation engines – increase cross-sell and repeat purchases by identifying other products in which a customer or prospect is likely to be interested
Web analytics - advanced click-stream, golden path analysis, viewer engagement, segmentation, and more.
Cross-channel marketing attribution – move beyond the skewed input of last click analysis to accurately determine campaign impact effectiveness across all channels
Influencer analysis – understand whose actions have impact in the network to encourage the behavior of peers for purchases, attrition, or just engagement.
Recommendation engines – increase cross-sell and repeat purchases by identifying other products in which a customer or prospect is likely to be interested
Web analytics - advanced click-stream, golden path analysis, viewer engagement, segmentation, and more.
Cross-channel marketing attribution – move beyond the skewed input of last click analysis to accurately determine campaign impact effectiveness across all channels
Influencer analysis – understand whose actions have impact in the network to encourage the behavior of peers for purchases, attrition, or just engagement.
Wednesday, March 6, 2013
Practical machine learning tricks - KDD 2011
Link
At first glance, this might appear to be a "Hello-World" machine learning problem straight out of a textbook or tutorial: we simply train a Naive Bayes on a set of bad ads versus a set of good ones. However this is apparently far from being the case - while Google is understandably shy about hard numbers, the paper mentions several issues which make this especially challenging and notes that this is a business-critical problem for Google.
--
There are many useful suggestions in this post.
At first glance, this might appear to be a "Hello-World" machine learning problem straight out of a textbook or tutorial: we simply train a Naive Bayes on a set of bad ads versus a set of good ones. However this is apparently far from being the case - while Google is understandably shy about hard numbers, the paper mentions several issues which make this especially challenging and notes that this is a business-critical problem for Google.
--
There are many useful suggestions in this post.
Subscribe to:
Posts (Atom)