Data Science Toolkit Brings Big Data Analysis to the People

Pete Warden, OpenHeatMap, at Structure Big Data 2011

Pete Warden, OpenHeatMap, at Structure Big Data 2011Pete Warden has been analysing big data on the cheap for years, and he wants you to be able to do the same: Warden, who got famous for scraping 220 million Facebook profiles, unveiled his Data Science Toolkit at at GigaOM’s Structure Big Data conference in New York today, allowing anyone to do automate conversions and analysis needed to make sense of massive amounts of data.

For example, Data Science Toolkit offers OCR functionality to convert PDFs or scanned image files to text files, filter geographic locations from news articles and other types of unstructured data or find political district and neighborhood information for any given location. Data Science Toolkit is available as a web service online, but it can also be downloaded and run on an Amazon EC2 or VM virtual machine.

Explaining the motivation for this release, Warden said during his talk that he has been living off of ramen noodles for years, but that didn’t stop him from getting creative with data analysis. “I’ve had my service living off this same kind of budget,” he said, adding: “You can hire a hundred servers from Amazon for $10 an hour.”

That’s exactly what Warden did last year after crawling Facebook and scraping 500 million pages that represented about 22 million users. He let Amazon’s servers loose on the scraped data, and 10 hours later had it boiled down to a database-ready format. “That was about a hundred bucks,” recollected Warden.

The result of these efforts was a massive analysis of friendship relationships on Facebook, which got him interviews with NPR, 500,000 visitors to his blog — and an angry call from Facebook’s chief legal counsel, who didn’t like what Warden was doing with the company’s data.

The conflict with Facebook lasted several months and ended up costing Warden $3000 in legal fees. “Big data? Cheap. Lawyers? Not so cheap,” he quipped. That episode may be one reason that Warden released the Data Science Toolkit under the GPL — no lawyers necessary to use it.

