3 Comments

Summary:

A team of professors behind the open source Spark and Shark in-memory big data projects has raised $13.9 million to commercialize the products via a company called Databricks. Spark and Shark are designed to be much faster and more flexible than Hadoop MapReduce and Hive.

A team of professors that has created the in-memory Spark and Shark platforms for analyzing big data has raised nearly $13.9 million to commercialize those products. The company is still in stealth mode, but it’s called Databricks and Andreessen Horowitz led the round. The only information on the company’s website is, “We are using cutting-edge technology based on years of research to build next-generation software for analyzing and extracting value from data. We created Apache Spark and Shark, and are deeply committed to open-source.”

It also lists Databricks’ very impressive board of directors: Co-founder and CEO Ion Stoica (University of California, Berkeley professor and co-founder and CTO of Conviva); Co-founder and CTO Matei Zaharia (MIT professor); Ben Horowitz (general partner at Andreessen Horowitz and former Opsware co-founder and CEO); and Scott Shenker (University of California, Berkeley professor and former Nicira co-founder and CEO). Stoica, Zaharia and Shenker have all been heavily involved in the creation of Spark and Shark, which are part of the UC-Berkeley AMPLab institution. Spark is also an Apache incubator project.

spark-lrFor those not familiar with Spark, it is a big data platform written in Scala and designed to run very fast. Stoica wasn’t much more forthcoming on details during a recent phone call, but he did explain the promise of Spark as compared with Hadoop MapReduce. Essentially, he said, it’s up to 100 times faster if your dataset can fit in memory, but it’s built to be significantly faster even on disk. It’s also architected differently than MapReduce in ways that make it ideal for machine learning algorithms and data mining workloads, where users might want to iterate on on existing results or repeatedly query a dataset with low latency.

Spark is also quite popular among web companies. It’s used by Yahoo, Airbnb, ClearStory Data and others, and more than 20 companies have contributed code to the project, according to its Apache page.

Shark is shorthand for “Hive on Spark,” which really means it’s a data warehousing framework compatible with Apache Hive but designed to run atop Spark rather than Hadoop MapReduce. Hive has become very popular as the de facto method of running SQL-like queries over data stored in Hadoop, but recently Hadoop vendors Cloudera and Hortonworks have undertaken their own efforts to either speed up Hive (which is slow because it relies on MapReduce) or eliminate it altogether for interactive queries. The Shark team claims it’s up to 100 times faster than Hive when running in memory.

It’s important to note, though, that Spark isn’t really an alternative to Hadoop as much as it is an alternative to MapReduce (and to Hive, with Shark). Many, many companies are already storing their data in the Hadoop Distributed File System, and Spark is designed to be compatible with HDFS. Especially with the advent of Apache YARN and Apache Mesos (another AMPLab creation), it’s very possible Spark could run alongside Hadoop MapReduce or Hive in the same cluster.

YARN (3)

The interesting thing to watch, though, will be how competitive Databricks ends up being with Hadoop vendors such as Cloudera, Hortonworks and MapR. I seriously doubt it wants to get into the business of managing and supporting big data clusters from the servers up, but Databricks certainly could ding licensing and support revenues on Cloudera Impala and other non-MapReduce processing frameworks for Hadoop. If companies have yet to make the big leap into the Hadoop pool, it’s conceivable they could opt to go with a Spark-based stack from the get-go.

But we’ll see for sure what’s up when Databricks takes the wraps off its software in the next few months.

Feature image courtesy of Shutterstock user Dabarti CGI.

Update: This post was updated at 6:16 a.m. to correct Ion Stoica’s title. He is co-founder and CTO of Conviva.

  1. There is another open source project that aims to beat Hadoop: https://stratosphere.eu/ is a project that started out as a research project at the TU Berlin university.
    It has a novel model that allows for more operators than just map and reduce. (It also natively supports join, cross and more). It additionally allows for arbitrary complex job graphs. So you can combine these operators in any way you like. So you could have three inputs, that are joined, reduced, mapped and reduced (by another key). You can even write to as many outputs as you want.
    Additionally and similarly to Spark, Stratosphere also supports iterative algorithms (often needed for Data Mining/Machine Learning). Since this is “natively” implemented into the system, Stratosphere does way better on those jobs than traditional Hadoop systems.
    I’d say that Spark is a bit better for small and medium-sized workloads, Stratosphere is more designed for super large big data tasks.

    There is an actively developed open source version of it on GitHub: https://github.com/dimalabs/ozone

    Share
  2. Also check out apache drill that is an alternative that provides interactive querying speed on large datasets http://incubator.apache.org/drill

    Share

Comments have been disabled for this post