5 Comments

Hortonworks is making progress on its mission (via a project called Stinger) to speed up SQL-like queries in Hadoop using Apache Hive. New features in the latest version of Hortonworks’ Hadoop distribution have improved Hive performance tens of times in some instances, and the company is aiming for 100x improvements soon. Hortonworks has also added support for new types of SQL data. Competitor Cloudera opted to forgo Hive in favor of its own Impala technology for interactive queries.

  1. Who is going to win out in this space… Cloudera or Hortonworks… and why?

    Share
  2. Derrick, I would like to clarify something: it is incorrect to say that Cloudera forgoed Hive. Cloudera continues to actively commit code and support Hive (not to mention that Hive was the brainchild of Jeff Hammerbacher who is one of Cloudera’s co-founders). What we decided (based on reading the Tenzing/Dremel tea leafs from Google) is there are two different design points in this space: (1) a system focused on hi-throughput long-running batch jobs (that is Hive/Tenzing), and (2) a system focused on low-latency interactive queries (that is Impala/Dremel). We continue to support both systems, they are both important but for different kinds of workload goals. The design points are very different though, it is like designing a race car which is really about finishing the race as quickly as possible at highest speed regardless of fuel-consumption (or car tear and wear), versus designing a commute car which is really about traveling very long distances at consistent speed with the most efficient fuel consumption (and low probability of breakage). It is very hard to design a car that meets both of these design goals at same time, and similarly it is very hard to design a query system that can both handle low latency interactive workloads, and long-running high-throughput jobs. The Stinger initiative will be able to significantly improve the bandwidth of Hive jobs (which will benefit Cloudera’s distro), but I doubt they will be able to get the latency of Hive to be competitive with Impala.

    Thus, it is my [biased] opinion that with Cloudera you are guaranteed to get the best of both worlds.

    Cheers,

    — amr

    Share
    1. Agreed with Amr, this is an inaccurate statement.

      Impala is for interactive BI on Hadoop via SQL, Hive is for batch processing with SQL. Customers need both. Cloudera is committed to both. See our recent work on HiveServer2 and Sentry for Hive as demonstration of that commitment:
      http://www.cloudera.com/content/cloudera/en/about/press-center/press-releases/release.html?ReleaseID=1840594
      http://blog.cloudera.com/blog/2013/07/how-hiveserver2-brings-security-and-concurrency-to-apache-hive/

      A few key points from the above:
      * Cloudera contributed HiveServer2 to unblock security and concurrency in Hive for the first time.
      * We started development on HiveServer2 *after* starting the Impala project because we remain committed to Hive as a critical batch SQL processing engine for CDH.
      * Cloudera developed Sentry to provide authorization (fine-grained role-based access controls) for Hive in addition to the authentication built into HiveServer2.
      * The Cloudera Hive team continues to grow in size.

      Matt

      Share
    2. Derrick Harris Monday, September 9, 2013

      Amr, thanks for the comment. I certainly didn’t mean to suggest Cloudera doesn’t support Hive, which is why i wrote “decided to forgo … for interactive queries.” Hive isn’t going anywhere anytime soon, and I know Cloudera knows that.

      Share
    3. Easy there Amr. I don’t think your commitment to Hive needs to be supplanted blatant inaccuracies about it’s birthage. Jeff found out the name of the project from a commit message and he was surprised it was a SQL runtime and not just a metadata management system. Enough said.

      Share

Comments have been disabled for this post