4 Comments

Summary:

Netflix is at it again, this time showing off its homemade architecture for running Hadoop workloads in the Amazon Web Services cloud. It’s all about the flexibility of being able to run, manage and access multiple clusters while eliminating as many barriers as possible.

hadoop nflx

Netflix is the undeniable king of computing in the cloud — running almost entirely on the Amazon Web Services platform — and its reign expands into big data workloads, too. In a Thursday evening blog post, the company shared the details of its AWS-based Hadoop architecture and a homemade Hadoop Plaform as a Service that it calls Genie.

That Netflix is a heavy Hadoop user is hardly news, though. In June, I explained just how much data Netflix collects about users and some the methods it uses to analyze that data. Hadoop is the storage and processing engine for much of this work.

hadoop nflxAs blog post author Sriram Krishnan points out, however, Hadoop is more than a platform on which data scientists and business analysts can do their work. Aside from their 500-plus-nod cluster of Elastic MapReduce instances, there’s another equally sized cluster for extract-transform-load (ETL) workloads — essentially, taking data from other sources and making it easy to analyze within Hadoop. Netflix also deploys various “development” clusters as needed, presumably for ad hoc experimental jobs.

And while Netflix’s data-analysis efforts are pretty interesting, the cloud makes its Hadoop architecture pretty interesting, too. For starters, Krishnan explains how using S3 as the storage layer instead of the Hadoop Distributed File System means, among other things, that Netflix can run all of its clusters separately while sharing the same data set. It does, however, use HDFS at some points in the computation process to make up for the inherently slower method of accessing data via S3.

Netflix also built its own PaaS-like layer for Amazon Elastic MapReduce, which it called Genie. This lets engineers submit jobs via a REST API and without having to know the specifics of the underlying infrastructure. This is important because it means Hadoop users can submit jobs to whatever clusters happen to be available at any given time (Krishnan goes into some detail about the resource-management aspects of Genie) and without worrying about the sometimes-transient nature of cloud resources.

We’ve long been pushing the intersection of big data and cloud computing, although the reality is that there aren’t really a lot of commercial options that mix user-friendliness and heavy-duty Hadoop workload management. There’ll no doubt be more offerings in the future — Infochimps and Continuuity are certainly working in this direction, and Amazon is also pushing its big data offerings forward — but, for now, leave it to Netflix to build its own. (And if you’re interested in custom-built Hadoop tools, check out our recent coverage of Facebook’s latest effort.)

  1. Love these stories Derrick. This one reminds me a lot of the one you’ve written about Facebook earlier in the year. The key question for each of these profiles of me is: how can companies of all sizes learn from such examples?

    Some kind of checklist would be interesting to share – there are millions of companies that need help with Big Data but the task still seems daunting to them because they assume that they need the big teams, big hardware and big budgets that these leaders have.

    Analytically Yours,
    Bruno Aziza
    http://www.sisense.com

    Share
    1. The good news is that Netflix is open sourcing lots of stuff that will help us small guys. :)

      Share
  2. As companies gather greater volumes of disparate kinds of data (i.e., Both structured and unstructured), they are also looking for solutions that can scale. This kind of data includes data generated by social media activity streams, existing ERP systems, and network data. Continuous, real-time analysis of large amounts of data is becoming more prevalent. For example, telecommunications service providers need real-time visibility into activation system service orders, showing activation times and lists of potential exceptions. Internet service providers need real-time visibility across field work operations to improve the management and prioritization of work orders, installer schedules, and maintenance requests. The recent buzz around “Big Data” solutions is growing louder, but only Operational Intelligence solutions are purpose-built for analyzing Big Data.

    Share
  3. 35+ Speakers, 20+ Sessions, 18+ Exhibitors are attending at Global Big Data Conference on Jan 28 2013, Santa Clara Convention Center. Register on http://globalbigdataconference.com/registration.php to attend the event. Get 20% offer using the Discount code BLOG.

    Share

Comments have been disabled for this post