8 Comments

Summary:

Cloudera is touting the speed of its Impala query engine compared to Hive and a leading relational database system, but those aren’t really apples-to-apples comparisons. The real question is how all the SQL-on-Hadoop options stack up against one another.

Source: Cloudera

Hadoop vendor Cloudera is singing the praises of its own SQL query engine, releasing on Monday the results of a benchmark  that shows how Cloudera Impala compares to Apache Hive and a mystery proprietary database. As one might expect, Impala easily bested its competitors in the benchmarks (no vendor has ever, to my knowledge, released results highlighting its product’s inferiority), but Hive and SQL databases probably aren’t Impala’s real rivals.

Its more-direct competition comes from other Hadoop vendors doing their own things to try and make Hadoop queries faster and more interactive. Because the choice right now isn’t to Hadoop or not to Hadoop, it’s which flavor of Hadoop to do. Companies that are using Hive are already using Hadoop, so that decision has been made. And even Cloudera — unless its stance has shifted drastically — acknowledges that Impala isn’t yet a replacement for a purpose-built data warehouse or relational database systems.

(Although, a future where Hadoop vendors do actively try to upset the database market would be interesting. Maybe we’ll get a sense of how realistic during sessions with the CEOs of Cloudera, Hortonworks and Pivotal at Structure Data in March.)

Source: Cloudera

Impala vs. “DBMS-Y.” Source: Cloudera

If having some degree of interactive SQL queries is important to users, they’ll likely be comparing one Hadoop distribution to another on this front. So while Cloudera is smart to position the choice as being between Impala, Hive and DMBS-Y (“one of the top 5 commercial MPP query engines on the market,” a Cloudera spokesperson confirmed), the more relevant comparison is probably between Impala and the Hortonworks-backed Apache Stinger/Tez, Pivotal HD Hawq, Presto (on Qubole), the MapR-backed Apache Drill, Hadapt, IBM BigSQL, Shark … you get the point.

For what it’s worth, everyone is faster than Hive — that’s the whole point of all of these SQL-on-Hadoop technologies. How they compare with each other is harder to gauge, and a determination probably best left to individual companies to test on their own workloads as they’re making their own buying decisions. But for what it’s worth, here is a collection of more benchmark tests showing the performance of various Hadoop query engines against Hive, relational databases and, sometimes, themselves.

Impala vs. Hive

Source: Cloudera

Source: Cloudera

Stinger/Tez vs. Hive

Source: Hortonworks. Check out this blog post for more details.

Source: Hortonworks. <a href="http://hortonworks.com/blog/3-reasons-try-stinger-phase-3-technical-preview/">Check out this blog post</a> for more details.

Pivotal HD Hawq vs. Impala and Hive

Source: Pivotal. Check out this whitepaper for more details.

Source: Pivotal. <a href="http://www.gopivotal.com/sites/default/files/Hawq_WP_042313_FINAL.pdf">Check out this whitepaper</a> for more details.

Shark vs. Impala, Hive and Amazon Redshift

Source: AMPlab (UC-Berkeley). Check out this blog post for more details.

Source: AMPlab (UC-Berkeley). <a href="https://amplab.cs.berkeley.edu/benchmark/">Check out this blog post</a> for more details.

  1. You mention “those aren’t really apples-to-apples comparisons”. Could you elaborate a bit more on why you feel this is the case?

    Share
    1. Derrick Harris Monday, January 13, 2014

      Because Hive wasn’t built for this type of query and SQL on Hadoop products at this point aren’t really a viable replacement to relational systems for many jobs, thus all the Hadoop-RDBMS partnerships/integrations.

      So, for now (although it could certainly change re: RDBMSes over time) the decision of which SQL-Hadoop tech to use really seems like a choice between those technologies/distributions and not between SQL-Hadoop *or* Hive or RDBMS.

      Share
      1. Does this mean that you aren’t optimistic about the performance competitiveness of ‘Stinger’ with the other SQL-Hadoop players?

        Remember, Hive 0.12 is Stinger Phase 2 (http://hortonworks.com/blog/announcing-apache-hive-0-12/), and the most recent published benchmarks for Phase 3 (http://hortonworks.com/blog/3-reasons-try-stinger-phase-3-technical-preview/) only show ~2.7x speedup over Phase 2 with warm Tez caches, which still wouldn’t bring it into the ballpark of DBMS-Y / Impala / etc.

        Hive 0.12 / Stinger is a SQL-on-Hadoop technology. So it seems a fair comparator for benchmarks.

        Share
  2. Cloudera’s post is an answer to Hortonwork’s (http://hortonworks.com/blog/3-reasons-try-stinger-phase-3-technical-preview/ ) in order to say “we’re still the best”.
    Hive12 includes Stinger phase 1&2 enhancements, most others “vs Hive” comparisons use previous, slower versions. HW recalls that further speed improvements are to be expected (using Tez, warm containers, …).
    Scalability is an interesting point of Cloudera’s post, as it seems to scale (up to 36 nodes) through all scenarios (more hardware, data, user activity).

    Share
  3. Matt Brandwein Monday, January 13, 2014

    The significant milestone – and the reason we published these benchmarks – is that customers now have a viable open source alternative to proprietary MPP analytic databases, one that also delivers the core scalability, flexibility, and economic benefits of Hadoop.

    Impala’s comfortable and widening lead over even the latest versions of Hive only demonstrates the positive impact of a special-purpose query engine versus trying to tune/evolve a general-purpose framework.

    Share
    1. Derrick Harris Tuesday, January 14, 2014

      That’s a fair point, but are you suggesting that Impala is already an alternative for RDBMS workloads on a wide scale — like replace Vertica with Impala good? That would indeed be something, but I don’t recall a single Hadoop exec ever copping to that.

      Share
  4. A great question re: are open source SQL-on-Hadoop viable alternatives to Vertica? We track performance characteristics of both the SQL-on-Hadoop and the RDBMS alternatives Vertica, Netezza, Greenplum, etc based on both internal metrics and results from bake-offs with InfiniDB. From what we see, both Impala and InfiniDB performance characteristics overlap with the best of breed RDBMS alternatives, and when price is included the price/performance is heavily weighted in favor of the open source options. Note that this would be for Analytic/Reporting workloads, and not for transactional or semi-transactional workloads. The challenge for Impala remains in handling more complex SQL and more complex analytic queries.

    InfiniDB has sponsored an independent benchmark comparing Hive, Impala, Presto, and InfiniDB done by Radiant Advisors The report can be downloaded (with registration) at http://www.infinidb.co/blog/open-source-sql-hadoop-query-engines-benchmark

    A quick summary
    InfiniDB supported syntax for all 10 queries, running between 1.28 and 17.62 seconds.
    Impala supported syntax for 7 of 10 queries, running between 3.1 and 69.38 seconds.
    Presto supported syntax for 9 of 10 queries, running between 18.89 and 506.84 seconds.
    Hive 0.11 supported syntax for 7/10 queries, running between 102.59 and 277.18 seconds.
    Hive 0.12 supported syntax for 7/10 queries, running between 91.39 and 325.68 seconds.
    Note that 3 of the 7 queries supported with Hive did not complete due to resource issues.

    A full presentation of the testing process, benchmark results, and interactive discussion is scheduled for Wednesday, April 23 https://www3.gotomeeting.com/register/961338942.

    Share
    1. Summary chart available here:
      http://bit.ly/1n93muT

      Share

Comments have been disabled for this post