4 Comments

Summary:

Carter S. won his first-ever Kaggle competition — our own GigaOM WordPress Challenge — using a brute force method of data science he calls overkill analytics. Rather than spend untold hours perfecting complex models, Carter used simple algorithms and let powerful microprocessors do the rest.

Meet Carter S. He used to be a lawyer, but now he writes predictive models for an insurance company. Admittedly green in certain new or advanced modeling methods, he prefers to use simple algorithms and throw as much computing power as possible problems. He calls the technique “overkill analytics,” and it just won him his first contest on Kaggle, defeating more than 80 other competitors in the GigaOM WordPress Challenge: Splunk Innovation Prospect  (see disclosure).

Not only was this Carter’s first win, it was also his first contest. You can read the detailed explanation of his victory on his blog, but the gist is that he didn’t get too involved with complex social graphing to determine relationships or natural language processing to determine topics readers liked. He figured out that most of what people liked came from blogs they’ve already read, and that the vast majority of posts people liked fell within a three-node radius on a simple social graph.

Statistically speaking, he did a generalized linear regression model, followed by a random forest model and averaged the results. “I’m not sure it’s a very unique technique,” he told me, “but it’s certainly a very powerful one.”

Source: Overkill Analytics

And therein lies the beauty of overkill analytics, a term that Carter might have coined, but that appears to be catching on — especially in the world of web companies and big data. Carter says he doesn’t want to spend a lot of time fine-tuning models, writing complex algorithms or pre-analyzing data to make it work for his purposes. Rather, he wants to utilize some simple models, reduce things to numbers and process the heck out of the data set on as much hardware as is possible.

It’s not about big data so much as it is about big computing power, he said. There’s still work to be done on smaller data sets like the majority of the world deals with, but Hadoop clusters and other architectural advances let you do more to that data in a faster time than was previously possible. Now, Carter said, as long as you account for the effects of overprocessing data, you can create a black-box-like system and run every combination of simple techniques on data until you get the most-accurate answer.

I wrote about the same general theory recently in explaining why Sparked.com’s Daniel Wiesenthal believes that big data (i.e., lots and lots of data combined with new storage and processing technologies) improves the practice of data science (i.e., the application of statistical techniques to data). The gist of his theory is that although complex models are great for small data sets, simple models can close the accuracy gap when applied to large data sets. Combine that with infrastructure that can process a lot of data relatively fast and support a wide variety of jobs, and you have a simpler, faster equally effective method.

Still, Carter said he didn’t get involved in Kaggle just to prove the effectiveness of overkill analytics. He does hope to get exposed to new data science techniques that haven’t yet caught on in the insurance industry, and he also wants to make a name for himself. When you work for a company with little turnover, he said, your professional network doesn’t grow too much, but doing Kaggle competitions is a great way to meet other data scientists — and winning is a great way to earn respect.

Ali Ahmad (username Xali) won the separate Splunk Innovation portion of the contest. According to a statement from Splunk, he “used Splunk’s built in statistical and visualization features to map out the relationship between blogs containing YouTube videos with those that are most likely to be viral, as measured by likes and shares. As a bonus, he fed the data into an app to view the YouTube videos most commonly liked and shared via WordPress blogs!”

Disclosure: Automattic, maker of WordPress.com, is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, GigaOm. Om Malik, founder of GigaOm, is also a venture partner at True.

Feature image courtesy of Shutterstock user nasirkhan.

You’re subscribed! If you like, you can update your settings

  1. Great article – one of my challenges as a data analyst is producing reports that the business ‘gets’. They often think they want to see more information, but I find the more information they have, the more useless it becomes. This streamlined approach captures everything, but draws the focus where it needs to be. I’ll be integrating this technique into my reporting!

  2. Peter Norvig, Director of Research at Google, has a similar philosophy (aimed at natural language processing) that’s captured in a recent paper (and talk) called “The Unreasonable Effectiveness of Data”. In a nutshell, what others have tried to achieve with complicated models can be done using simple algorithms, and reams and reams of data. The talk can be found on youtube and is worth watching.

  3. Peter Norvig, Director of Research at Google, has a similar philosophy aimed at the problem of natural language processing. His talk (and paper) “the unreasonable effectiveness of data” is worth watching (it’s on youtube). The premise is that problems where complicated models were once needed, can now be solved by simple algorithms and lots of data.

  4. Theodore Van Rooy Friday, September 28, 2012

    The downside: overkill-analytics often leads to overfit-analytics. As Carter himself points out: “as long as you account for the effects of overprocessing data”.

    The more computation power you throw at the problem the more likely it is that your model happens (by chance) to work well … even measures like cross fold validation will not prevent this overfitting.

    Not saying that it doesn’t work as an approach (it’s my favorite approach too) but that you should at least put some thought into the problem before you throw massive computational power at it.

Comments have been disabled for this post