3 Comments

Summary:

Apache Spark, an in-memory data-processing framework, is now a top-level Apache project. That’s an important step for Spark’s stability as it increasingly replaces MapReduce in next-generation big data applications.

MapReduce was fun and pretty useful while it lasted, but it looks like Spark is set to take the reins as the primary processing framework for new Hadoop workloads. The technology took a meaningful, if not huge, step toward that end on Thursday when the Apache Software Foundation announced that Spark is now a top-level project.

Spark has already garnered a large and vocal community of users and contributors because it’s faster than MapReduce (in memory and on disk) and easier to program. This means it’s well suited for next-generation big data applications that might require lower-latency queries, real-time processing or iterative computations on the same data (i.e., machine learning). Spark’s creators from the University of California, Berkeley, have created a company called Databricks to commercialize the technology.

Spark is technically a standalone project, but it was always designed to work with the Hadoop Distributed File System. It can run directly on HDFS, inside MapReduce and, thanks to YARN, it can now run alongside MapReduce jobs on the same cluster. In fact, Hadoop pioneer Cloudera is now providing enterprise support for customers that want to use Spark.

The ecosystem of Spark projects. Source: Databricks

The ecosystem of Spark projects. Source: Databricks

However, MapReduce isn’t yesterday’s news quite yet. Although many new workloads and projects (such as Hortonworks’ Stinger) use alternative processing frameworks, there’s still a lot of tooling for MapReduce that Spark doesn’t have yet (e.g., Pig and Cascading), and MapReduce is still quite good for certain batch jobs. Plus, as Cloudera co-founder and Chief Strategy Officer Mike Olson explained in a recent Structure Show podcast (embedded below), there are a lot of legacy MapReduce workloads that aren’t going anywhere anytime soon even as Spark takes off.

If you want to hear more about Spark and its role in the future of Hadoop, come to our Structure Data conference March 19-20 in New York. Databricks co-founder and CEO Ion Stoica will be speaking as part of our Structure Data Awards presentation, and we’ll have the CEOs of Cloudera, Hortonworks, and Pivotal talking about the future of big data platforms and how they plan to capitalize on them.

Featured image from Thinkstock/Loops7

  1. I love how MapReduce is now considered ‘legacy’ is the Hadoop menagerie of engines.

    Reply Share
  2. Sudhir Nallagangu Monday, March 3, 2014

    I get the impression MapReduce and Spark serve different use cases. MapReduce targeting batch and analytic processing while spark meeting real time needs. Both use HDFS with Spark having the ability to have several “streams” of real time input data while MapReduce goes against both static SQL/NOSQL big data.

    Reply Share
    1. Derrick Harris Monday, March 3, 2014

      They certainly can, which is why Spark has so much excitement right now for interactive SQL, machine learning, stream processing, etc.

      Reply Share