Spark is now part of MapR’s Hadoop distro, too

Hadoop vendor MapR is getting in early on the Apache Spark action, too, announcing on Thursday that it’s adding the Spark stack to its Hadoop distribution as part of a partnership with Spark startup Databricks (Ion Stoica, the co-founder and CEO of which, is pictured above). Spark allows for faster processing and easier programming of big data workloads.

An in-memory processing framework originally developed at the University of California, Berkeley, Spark has been rising in popularity over the past year or so, but it really hit the mainstream with the launch of Databricks in September 2013. Since then, Cloudera has added Spark to its Hadoop distribution (as part of a partnership with Databricks), the Apache Spark project has reached top-level status, and numerous projects and companies originally designed with Hadoop in mind are planning to support Spark or move to it whole hog.

These include Cloudera’s Oryx project, analytics startup Platfora and even the Apache Mahout project, as well companies participating in Databricks’ certification program for Spark.

Spark is arguably so popular right now as much because of what it is as what is isn’t: MapReduce. The traditional data-processing framework for Hadoop, MapReduce is slow (it’s a batch processor) and notoriously difficult to program. Spark is fast and flexible — making it better for tasks such as machine learning, graph processing and interactive queries — and easy to program. It’s written in Scala, but also supports programming in Java, Python and, in time, R.

Much of this support for Spark is possible because of YARN, the resource-management system that’s part of Hadoop 2.0 and lets numerous processing frameworks run simultaneously on the same cluster, all accessing the Hadoop Distributed File System for storage.

Source: Databricks
Source: Databricks

An interesting note about the MapR news is that the company is supporting the entire Spark stack — which includes the Shark SQL query engine (it’s essentially a faster Apache Hive) and MLLib machine learning library — whereas Cloudera does not support Shark. Presumably, that’s because Cloudera is still incentivized to push its Impala SQL query engine, which is not built on MapReduce, either. MapR has been leading the development of the Apache Drill project for interactive SQL queries, but has also added native support for HP Vertica as Drill comes along. (Results from benchmark tests of several new big data query engines are available here, but the SQL-on-Hadoop space is much bigger than what’s included there.)

From a MapR perspective, the addition of Spark advances that company’s approach of standing out in the Hadoop space (where it has received much less attention and has raised much less capital than competitors Cloudera and Hortonworks) by incorporating the technologies its customers want. Already, for example, MapR has developed its own version of the HBase NoSQL data store that is more fully featured than the open source version included in other Hadoop distributions.

It’s technologies such as Spark — and anything running atop YARN, really — that make Hadoop such a potentially disruptive force for existing data industry vendors. Apache Hadoop has always offered cheap, open source storage, but now the ecosystem is turning Hadoop into a platform that can do many things on top of all that data. We’ll see a lot more analytic applications, possibly even databases, launch in the next couple years using Spark and similar technologies as their engines.