4 Comments

Summary:

Netflix has open sourced its software to make running Hadoop jobs on the Amazon Web Services cloud as easy as possible.

GenieLogo

Netflix runs a lot of Hadoop jobs on the Amazon Web Services cloud computing platform, and on Friday the video-streaming leader open sourced its software to make running those jobs as easy as possible. Called Genie, it’s a RESTful API that makes it easy for developers to launch new MapReduce, Hive and Pig jobs and to monitor longer-running jobs on transient cloud resources.

In the blog post detailing Genie, Netflix’s Sriram Krishnan makes clear a lot more about what Genie is and is not. Essentially, Genie is a platform as a service running on top of Amazon’s Elastic MapReduce Hadoop service. It’s part of a larger suite of tools that handles everything from diagnostics to service registration.

It is not a cluster manager or workflow scheduler for building ETL processes (e.g., processing unstructured data from a web source, adding structure and loading into a relational database system). Netflix uses a product called UC4 for the latter, but it built the other components of the Genie system.

genie-arch

Netflix first discussed Genie in January, when it showed off the company’s overall Hadoop architecture within the AWS cloud. While Genie is near the top of the overall stack, the foundation is interesting, as well. Rather than maintaining a massive set of instances (or multiple separate ones) running the Hadoop Distributed File System, Netflix uses Amazon’s S3′s object storage service as its big data bit bucket, so all of its Hadoop jobs access the common, reliable data store.

nflxhadoop

As with the rest of Netflix’s numerous open source projects on top of AWS — it runs the entire streaming business on the platform — it’s hard to gauge how much traction they’ll pick up or what kinds of products they might inspire. Netflix Cloud Architect Adrian Cockroft has told me he’s fielding inquiries from quite a few large companies and organizations that want to build their own internal Netflix cloud platform as a service, essentially. Smaller companies are adopting these tools, too, although it can be difficult to track who exactly is accessing the code from Github and what they’re doing with it.

AWS might get inspired to build on the Netflix code, or at least take a lesson from it. In the Hadoop space alone, Elastic MapReduce is a pretty low-level services, but Netflix’s Genie makes it more akin to higher-level offerings such as Altiscale, Qubole, Infochimps, Continuuity and Mortar Data. AWS might be fine selling standard Lego blocks, as Cockroft described most AWS services (in fact, some of the aforementioned services run on AWS), but there’s a lot of money to be made selling those Stars Wars kits that add polish to the original.

  1. Nitin Karandikar Saturday, June 22, 2013

    “Rather than maintaining a massive set of instances (or multiple separate ones) running the Hadoop Distributed File System, Netflix uses Amazon’s S3′s object storage service as its big data bit bucket”

    Why is it a surprise that they use S3 to store the data used by Elastic MapReduce? It seems the logical thing to do. Hadoop jobs running on AWS are easily set up to access data from S3, and the cost of transferring large-scale data sets to and from S3 is likely to be much larger than that of storing it in S3 in the first place; there is also the time lag involved in getting the data to and from S3 if the data is external.

    Share
    1. You are concluding that by using S3 the overall cost is “high” right?
      1) cost of data transfer to/from is high
      2) cost of storing in buckets is high
      3) latency/lag time involved in fetching/storing data from/to is high

      If so then why Netflix uses S3 instead of other options? can you explain

      Share
      1. @Jare

        Correct on the latency point, but otherwise
        1) Cost of transfer in is free and free between S3 and EC2 instances with same region.
        2) S3 storage is the lest expensive option (and with Reduced Redundancy (RRS) cheaper) that scales infinitely – other options do not.

        Rule-of-thumb: store it once (only) on the correct medium: S3s fits that bill exactly.

        Share
  2. The Genie is out.
    Kudos netflix for the initiative.
    A rich addition to the open source, big data space.

    Share

Comments have been disabled for this post