13 Comments

Summary:

With the release of Hadoop 2.0, MapReduce takes a back seat. But who in the front seat will take the wheel?

When Hadoop first started gaining attention and early adoption it was inseparable – both technologically and rhetorically – from MapReduce, its then-venerable big data-processing algorithm. But that’s changing, and rapidly. With the release of Hadoop 2.0, MapReduce is taking a backseat to some newer technology. But of all the front-seat occupants, who will take the wheel?

MapReduce in big data history

Originally the MapReduce algorithm was essentially hardwired into the guts of Hadoop’s cluster management infrastructure. It was a package deal: big data pioneers had to take the bitter with the sweet. At first this seemed reasonable, since MapReduce is truly powerful, as it divides the query work – and the data – up amongst the numerous servers in its cluster, facilitates teamwork between them, and gets the answer.

The problem underlying all of this is pretty simple: MapReduce’s batch-processing approach (where jobs are queued up and then dutifully run) doesn’t cut it when multiple, short-lived queries need to be run in quick succession. Hadoop 2.0 introduces YARN (an acronym for “yet another resource negotiator”) as a processing algorithm-independent cluster management layer. It can run MapReduce jobs, but it can host an array of other engines, as well.

Along comes Spark

Meanwhile, separate from the development of YARN, an organization called AMPLab, within the University of California at Berkley, developed an in-memory distributed processing engine called Spark. Spark can run on Hadoop clusters and, because it uses memory instead of disk, can also avoid MapReduce’s batch mode malaise. Better still, Hortonworks has certified Spark as “YARN ready” with Databricks (the commercial entity founded by Spark’s AMPLab creators), providing a quote in the press release announcing the designation.

So far, so good. YARN provides a general framework for batch and interactive engines to process data in a Hadoop cluster, and Spark is one such engine, which utilizes Random Access Memory (RAM) for very fast results on certain workloads.

A question remains though: what about other Hadoop distribution components – like SQL query layer Hive or data transformation scripting environment Pig – that have a reliance on MapReduce? How can those components be retrofit to take advantage of the shifts in Hadoop’s architecture?

Up the stack

Hortonworks, whose engineering team effectively spearheaded the work on YARN, also created a component for it called Tez that sandwiches in between Hive or Pig on the one hand, and YARN on the other. Hortonworks added Tez to Hadoop’s Apache Software Foundation source code as it did an updated version of Hive.

Get the most recent versions of Hive and Hadoop itself and, bam!, you can use them interactively for iterative query work. Meanwhile, an industry consortium, which includes Cloudera and MapR, has announced it will be retrofitting Hive and Pig – as well as most other Hadoop distro components – to run directly against Spark.

Symbiotic adversaries

Spark and Tez, which in most contexts likely wouldn’t be compared, suddenly find themselves competitors. Both of them pave the way for MapReduce’s diminished influence, and for interactive Hadoop to move to the mainstream. But with the competitive approaches they offer, there is a risk of fragmentation here and customers should take notice.

In-memory engines work extremely well for certain workloads, machine learning chief among them. But making an in-memory engine the default for most every job, especially those that get into petabyte-scale (or higher) data volumes seems unorthodox.

Batch-oriented MapReduce having exclusive placement for data processing wasn’t Enterprise-ready. YARN, Tez and Spark have all emerged out of the need to address that shortcoming. The irony here is that giving customers multiple ways to use the very same Hadoop distribution components isn’t especially well-suited to the Enterprise.

The engines, united?

If YARN’s open architecture is to enable multiple, nuanced, overlapping solutions, then an optimizer that picks the right one for a given query may be needed, so that the customer needn’t make that decision, query after query. Choice is good, but fragmentation and complexity are not.

In the 1980s, the UNIX operating system splintered badly, and this impeded market momentum for that operating system. In this decade, Hadoop has become a data operating system. Hopefully it will avoid UNIX-like entropy.

Update: This article previously stated that Hortonworks worked with personnel at Databricks to make Spark run on YARN.  It was updated on July 21st to reflect that Hortonworks certified Spark as “YARN ready.”

  1. A lot of softwares don’t need MapReduce (Impala replaces Hive etc) and this is probably the healthiest way if Hadoop is to survive, furthermore having 2 ecosystem projects compete will reduce the possiblity what data scientists dread – Hadoop being replaced by something from a more traditional tech player (IBM/Intel) and the end of the Big Data hype/boom.

    Reply Share
  2. “Hortonworks worked with personnel at Databricks (the commercial entity founded by Spark’s AMPLab creators) to make Spark run on YARN.” Hortonworks has actually contributed nothing to Spark on YARN, and nothing to Spark. AMPLab folks and contributors built this integration a year ago, with recent help from Cloudera. Credit where due.

    Reply Share
    1. >AMPLab folks and contributors built this integration a year ago, with recent help from Cloudera. Credit where due.

      Talking about credit, I wonder if contributions from others are not that significant as cloudera’s recent interest. Unless giving credit is all about advertising.

      Reply Share
      1. I don’t understand, but if you mean “AMPLab folks and contributors” misrepresents contributors to YARN integration, I mean it has been tgravescs, mridul, et al. with AMPLab committers. Right? I take issue with articles echoing Hortonworks’ claims, starting with “we certified SPARK on YARN”, when the company has actually contributed zero to the project. It too freely implies credit for work done by others, including my colleagues. Right?

        Reply Share
        1. People who made Storm-on-yarn release, are not part of Hadoop co mentioned here. Spark, Tez, Yarn are projects under Apache software foundation. Credit goes to its members. Many of them are employed by entities who may or may not have commercial interest in Hadoop. They collaborate across org to make projects successful.

          Tez vs Spark is more entertaining debate. It’s Shame to make it a Cloudera vs Hortonworks.

          Reply Share
          1. (The article isn’t about Storm.) I regret mentioning Cloudera, because its mostly others’ work that was misattributed. The mention makes people read it as vendors yapping. It wasn’t intended that way, and I don’t think it is: I called out a misleading idea marketed repeatedly by a vendor, and that has been corrected above. I don’t think the whole ASF is due credit for this particular functionality, since I know who didn’t work on it, and that’s central to the point. Yes, let’s look past marketing and actually talk about tech.

            Reply Share
            1. sorry, I meant to say spark-on-yarn.

              Share
            2. Well. Actually just referring to as “contributors” or by individual names the work of one organization which contributed majority of work and mentioning others by organization name is as misleading eclipsing the actual amount of contributions and doesn’t give due credit. I was confused seeing the comment and actually ended up mailing our Spark team and asking them din’t you guys tell that you did major work on Spark on YARN with AMPLabs :(. I hope in future credit is just given to the Apache (Spark) Community and if that is not the case proper credit is given to all stakeholders.

              Share
  3. Reynold Xin Sunday, July 20, 2014

    Actually Spark is capable of handling workloads much larger than the amount of memory available. It has a computation primitive for applications to leverage in-memory computation, but you can run it entirely like the way MR is run (stream through data) without using much memory. It is the most common misconception among people new to Spark.

    Reply Share
  4. What we need are runtime comparisions of the same workloads run through Tez and Spark, maybe with Hive as the frontend, and Tez vs Spark as the candidate backends. The workloads need to be devised such that the data size is at least an order of magnitude greater than the sum of the RAM in the cluster nodes.

    Reply Share
  5. Nice write-up Andrew. Fully agree with your perception of Hortonworks / Cloudera diverging strategies on the role of Spark and Tez in their respective stacks. Sean’s comment further reinforces this assessment. I actually wrote a much less elaborate blog post on the topic recently at https://www.linkedin.com/today/post/article/20140711075931-1325021-live-from-the-spark-summit-part-2-cloudera-s-bid-to-reclaim-open-source-leadership. Slightly different perspective but same conclusion…

    Reply Share
  6. Also worth mentioning:

    Apache Spark is among the most active Apache projects, with many, and widely diverse, contributors. Spark is also shipped by multiple platform vendors.

    In contrast, Tez contributions are dominated by a single company (Hortonworks), and Tez is also shipped by that one company. One can’t obtain support for it anywhere else.

    So, adoption considerations have to go beyond what a technology can “do”. One also has to consider in which direction the entire ecosystem is heading.

    Reply Share
  7. Checkout the Pig on Tez effort to see the diversity in the companies involved in getting things to run on Tez (includes Linkedin, Netflix, Yahoo and others). While many of the Tez committers are from Hortonworks, the are many committers/contributors from other companies.
    Cascading already has support for Tez. Hive 0.13 already has support for Tez. Former MapReduce users seem to be headed to use Tez.

    It is interesting to see “domination by single company” being used as an argument by a Cloudera employee! Tez is a community driven open source project (governed by Apache Software Foundation bylaws), unlike Cloudera’s flagship projects like Cloudera Manager and Cloudera Impala.

    Reply Share