2 Comments

Summary:

There was a lot of news about Spark’s ascension in the big data ranks this week, as well as some speculation. According to Cloudera’s Mike Olson, his company is widely embracing Spark — including to run Hive — but not in place of Impala.

Despite some speculation over the past few days about what it means that Cloudera wants to port the Hive SQL-on-Hadoop engine onto the Spark processing framework, Cloudera Co-founder and Chief Strategy Officer Mike Olson (pictured above) says nothing much has changed. Well, nothing has changed with regard to Cloudera’s Impala product, that is. There’s actually quite a bit happening elsewhere in the Hadoop and Spark ecosystems.

Simply put, Olson said Impala is the future of interactive SQL queries on top of Hadoop as far as Cloudera is concerned. “Impala is flat-out faster than the fastest thing Hortonworks or anyone else has ever done with Hive,” he said.

Cloudera — along with IBM, MapR and spark startup Databricks — is working to port Hive onto Spark as an acknowledgement that Hive workloads are still very important to the company’s customer base and that “running on MapReduce, Hive really, really sucks.” But, Olson added, Hive was built to be a batch-processing atop MapReduce, and even though it will run faster on Spark or the Hortonworks-driven Apache Tez framework, it will still be a batch job.

(Actually, he added, Cloudera et al are committed to moving pretty much every existing MapReduce workload onto Spark, including stuff such as Sqoop and Pig. Spark is “light years better,” he noted, and “we think it will succeed MapReduce in most instances.”)

The Spark stack. Source: Databricks

The Spark stack. Source: Databricks

Some might be asking where Shark — the Spark subproject whose name is a mashup of Spark and Hive — fits into this. Olson confirmed (actually, he pointed to a Spark Summit keynote by Databricks’ Patrick Wendell) that Databricks will sunset Shark after the next Spark release, opting instead to focus its efforts on a project called Spark SQL that the company announced in April.

Around that time, Databricks CEO Ion Stoica told database industry analyst Curt Monash the same, although he also mentioned plans to continue developing an interactive engine called BlinkDB. “[I]f I were to redraw [the Spark stack diagram], SparkSQL will replace Shark, and Shark will eventually become a thin layer above SparkSQL and below BlinkDB,” Stoica told Monash.

Olson didn’t mention BlinkDB (although, admittedly, I didn’t ask) but he say he’s not thrilled with the idea of Spark SQL. He acknowledged that Databricks is a smart company and will likely do a competent job with Spark SQL, but added that moving Hive onto Spark is a fast process while SparkSQL is still a work in progress.

“I would rather see those guys put all their efforts into other things,” he said. “… I think Hive on Spark is going to be pretty good.”

You’re subscribed! If you like, you can update your settings

Comment

Community guidelines
Tuesday, September 2, 2014
you are commenting using your account. Sign out / Change

Comment using:

Or comment as a guest

Be sure to review our Community Guidelines. By continuing you are agreeing to our Terms of Service and Privacy Policy.

2 Comments

  1. Wow mike is being controversial.

  2. I think if Impala doesn’t go in to Apache Foundation soon and get adopted by Hortonworks it’ll become a sidelined product. Even if it is a bit faster nobody wants to run something that only works on one of the vendors. MapR is shipping Impala along with just about everything else from the ecosystem but I don’t see that being enough without Apache Foundation rights to modify and improve it in unity to become the standard SQL on Hadoop solution.