Summary:

While the rest of the Hadoop world is trying to distance itself from Hive with new interactive engines, Hortonworks is trying to make it faster. It might actually be a sound strategy.

Hortonworks isn’t about to get off the Apache Hadoop elephant just because everyone around it is now trying to ride impalas. The company released version 1.3 of its Hortonworks Data Platform on Wednesday, a major aspect of which is an improved iteration of Apache Hive that the company claims runs 50 times faster the previous version. Over the next year or so, Hortonworks expects to improve the speed of Hive by 100x its previous limits — this while its competitors are all but leaving Hive in the dust in favor of newer, faster analytic systems.

If you’re unfamiliar with Hive, it’s a project that Facebook developed in 2008 to make Hadoop function more like a traditional enterprise data warehouse. Hive stores data inside the Hadoop Distributed File System in structured format, and then allows users to query it using a language very similar to SQL. Until very recently, Hive has been the de facto method for querying (in a traditional sense) data stored in Hadoop, and it has proven immensely popular as more companies have begun tackling their big data woes with Hadoop.

Hive wasn’t built for speed

However, as more companies got used to Hadoop, they also began to notice its shortcomings. One of them is around MapReduce, a powerful but not-exactly-speedy method of processing data that requires running the job across every node in the cluster in order to find the right data. Although the Hive interface is that of a SQL query, it relies on on MapReduce as the processing engine.

(For more on how Hadoop and its flavor of MapReduce came to be, read this post on the history of Hadoop. To see me speak with Google Fellow and MapReduce creator Jeff Dean about how far Google has moved from a MapReduce-centric computing model, come to Structure next month.)

Users wanted faster, more-interactive query processing on top of Hadoop, similar to what they had grown accustomed to with data warehouse systems such as Teradata, Greenplum and Netezza. Hadoop vendors such as Cloudera (with Impala), MapR (with Drill), IBM (with Big SQL) — as well as a spate of startups — have obliged with their own new technologies that in various ways blend the familiarity of SQL with the scalability of Hadoop. EMC Greenplum, now Pivotal, has transplanted its existing database system inside of Hadoop.

Even Qubole, a cloud-based startup from Hive creators Ashish Thusoo and Joydeep Sen Sarma, is keeping an eye on how projects such as Impala and Shark (from the University of California, Berkeley’s AMPLab) might factor into its plans.

Giving Hive a better “Stinger”

Hortonworks, the Yahoo spinoff dedicated to driving the Apache Hadoop bus, is sticking with Hive. But is has a plan, and a point.

Essentially, VP of Products Bob Page told me during a recent briefing, “It just makes more sense from our view to have everything done in one place.” He means that Hive is already the method by which most people are already comfortable using SQL to access Hadoop data, so there’s no use rocking the boat by adding yet another technology into the mix. Hortonworks will just make Hive faster to the point (100x) where it’s at least in the ballpark of what these entirely new systems are capable of doing, but where users still use the same tools for interactive and batch queries.

It has in place a three-phase plan, under the “Stinger” codename, in order to make this happen. The first phase, now available as part of the Hive 0.11 release, is a new set of analytic functions and a columnar file format that Page says has resulted in a 50x performance increase over the previous version. The next phase is to move YARN Hive off of MapReduce and onto a still-under-development processing framework called Tez.

stinger“You’ll see phase two come to bear later this year,” Page said, once YARN — a new resource manager that lets Hadoop clusters run multiple processing engines simultaneously — is ready for production.

The third phase is a whole new vector query engine for Hive and new tools for intelligent query planning. Page didn’t have a target date in mind for that phase, except to note that “we’re not talking about a five-year cycle.”

SQL isn’t the end game for Hadoop

It would be easy to dismiss Page’s and Hortonworks’ optimism about Stinger as a sweet lemons type of rationalization — the company was founded around Apache Hadoop and can’t really go about developing entirely new products outside that foundation — but they also appear to have their eyes focused on a future where SQL isn’t too big a differentiator.

SQL is the way folks used to data for the last 30 years can see how Hadoop fits in their environment, Page said, but the compelling thing about Hadoop “is it really unlocks a new way about how one thinks about storing and processing data.” Once YARN is ready to go, he added, there will be new avenues of innovation in areas like graph analysis and stream processing.

Page comes from a place of credibility when he talks about this evolution in thinking. Before coming to Hortonworks in March, he was vice president of analytics platform and delivery at eBay, a company that knows its way around big data. When people get all their data in one place, they want to do more things with it, he explained. The thinking becomes less about using Hadoop to lower cost and more about “How do I use Hadoop to increase my top line?”.

Besides, Page noted (echoing the sentiment of just about everybody else in the Hadoop space, including Cloudera CEO Mike Olson), even as companies turn Hadoop into their primary data store, it’s difficult to see Hadoop ever entirely replacing high-value relational data warehouse systems like Teradata. One could argue, then, that there’s no real purpose in trying too hard to match those systems in terms of capabilities.

At eBay, he said, they ran an in-depth analysis to see if it was economically or technologically feasible to collapse its big data workloads onto a single system. eBay has dozens of petabytes stored in Hadoop and possibly more within various Teradata appliances. The result: “We just couldn’t find a way in which we could justify collapsing everything we do into one system.”

Feature image courtesy of Shutterstock user vblinov.

Comments have been disabled for this post