Blog Post

Netflix shows off how it does Hadoop in the cloud

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Netflix (s nflx) is the undeniable king of computing in the cloud — running almost entirely on the Amazon Web Services (s amzn) platform — and its reign expands into big data workloads, too. In a Thursday evening blog post, the company shared the details of its AWS-based Hadoop architecture and a homemade Hadoop Plaform as a Service that it calls Genie.

That Netflix is a heavy Hadoop user is hardly news, though. In June, I explained just how much data Netflix collects about users and some the methods it uses to analyze that data. Hadoop is the storage and processing engine for much of this work.

hadoop nflxAs blog post author Sriram Krishnan points out, however, Hadoop is more than a platform on which data scientists and business analysts can do their work. Aside from their 500-plus-nod cluster of Elastic MapReduce instances, there’s another equally sized cluster for extract-transform-load (ETL) workloads — essentially, taking data from other sources and making it easy to analyze within Hadoop. Netflix also deploys various “development” clusters as needed, presumably for ad hoc experimental jobs.

And while Netflix’s data-analysis efforts are pretty interesting, the cloud makes its Hadoop architecture pretty interesting, too. For starters, Krishnan explains how using S3 as the storage layer instead of the Hadoop Distributed File System means, among other things, that Netflix can run all of its clusters separately while sharing the same data set. It does, however, use HDFS at some points in the computation process to make up for the inherently slower method of accessing data via S3.

Netflix also built its own PaaS-like layer for Amazon Elastic MapReduce, which it called Genie. This lets engineers submit jobs via a REST API and without having to know the specifics of the underlying infrastructure. This is important because it means Hadoop users can submit jobs to whatever clusters happen to be available at any given time (Krishnan goes into some detail about the resource-management aspects of Genie) and without worrying about the sometimes-transient nature of cloud resources.

We’ve long been pushing the intersection of big data and cloud computing, although the reality is that there aren’t really a lot of commercial options that mix user-friendliness and heavy-duty Hadoop workload management. There’ll no doubt be more offerings in the future — Infochimps and Continuuity are certainly working in this direction, and Amazon is also pushing its big data offerings forward — but, for now, leave it to Netflix to build its own. (And if you’re interested in custom-built Hadoop tools, check out our recent coverage of Facebook’s (s fb) latest effort.)

4 Responses to “Netflix shows off how it does Hadoop in the cloud”

  1. As companies gather greater volumes of disparate kinds of data (i.e., Both structured and unstructured), they are also looking for solutions that can scale. This kind of data includes data generated by social media activity streams, existing ERP systems, and network data. Continuous, real-time analysis of large amounts of data is becoming more prevalent. For example, telecommunications service providers need real-time visibility into activation system service orders, showing activation times and lists of potential exceptions. Internet service providers need real-time visibility across field work operations to improve the management and prioritization of work orders, installer schedules, and maintenance requests. The recent buzz around “Big Data” solutions is growing louder, but only Operational Intelligence solutions are purpose-built for analyzing Big Data.

  2. Bruno Aziza

    Love these stories Derrick. This one reminds me a lot of the one you’ve written about Facebook earlier in the year. The key question for each of these profiles of me is: how can companies of all sizes learn from such examples?

    Some kind of checklist would be interesting to share – there are millions of companies that need help with Big Data but the task still seems daunting to them because they assume that they need the big teams, big hardware and big budgets that these leaders have.

    Analytically Yours,
    Bruno Aziza