First: scoring models at scale

Netflix is open sourcing tools for analyzing data in Hadoop

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

The data team at [company]Netflix[/company] is opening sourcing some of the tools it uses to analyze data stored in Hadoop. The overall open source project is called Surus, and it focuses on user-defined functions (or UDFs) that Netflix has built for the Apache Hive and Pig, two higher-level frameworks that make it easier to query Hadoop data and write data-processing jobs.

The first tool Netflix has released as part of Surus is a Pig function, called ScorePMML, for scoring predictive models at scale. Within Netflix, the goal was to standardize the process of taking a model someone has built using R, for example then tested on a small dataset, and then running it against a much larger dataset in Hadoop and possibly rolling it out as a production model.

According to the blog post introducing Surus and ScorePMML, future releases will includes tools for tasks such as pattern recognition and outlier detection. The post goes into more detail about how ScorePMML works, where it shines and what are its limitations.

Netflix, of course, has become a poster child for the benefits of data analysis — using it to inform everything from content recommendations to streaming performance — and a leader in open source technology. For more on its efforts in both of these areas, check out my November interview with Netflix Chief Product Officer Neil Hunt, and our Structure Show podcast interview with Netflix engineers Ruslan Meshenberg and Andrew Spkyer (embedded below).

To learn even more about how the best and brightest companies around are using data to glean insights and build entirely new products, check out our Structure Data conference March 18 and 19 in New York. Speakers include data experts from companies such as BuzzFeed, Facebook, ESPN, Spotify and Goldman Sachs.

2 Responses to “Netflix is open sourcing tools for analyzing data in Hadoop”

  1. Peter Fretty

    This can only make the industry stronger. The more people are comfortable operating within a data centric environment, the closer we get to having a widespread culture that appreciates the potential opportunities when taking a deep dive into data.

    Peter Fretty, IDG blogger working on behalf of SAS