Table of Contents
- Motivation and historical context
- Big data and the move to Hadoop
- Problems with Hadoop
- Why SQL on Hadoop?
- The emergence of SQL-on-Hadoop solutions
- Disruption vectors and methodology
- (Read) performance for interactive analytics
- Mind share
- Flexibility and compatibility
- Operational Hadoop
- Teradata (SQL-H)
- EMC Greenplum (HAWQ)
- Citus Data
- Splice Machine
- Concurrent (Lingual)
- Apache projects
- Apache Drill
- Hortonworks Stinger
- Conclusion and outlook
- Addendum: RainStor
- About Joseph Turian
Today’s most successful companies are the ones with the ability to capture and analyze all data available to them. Successful and efficient use of big data, after all, can allow improved decision making across the spectrum, from huge strategic questions like “Should we enter this new market?” to fine-grained, execution-oriented questions like “What should this SKU be priced?”
However, as is widely known, the sheer amount of data any given company can produce has become unmanageable and therefore not valuable to businesses. Hadoop promised to alleviate this problem. To some degree it has, but the framework has also spawned many subtle issues that were not initially foreseen. A lot of companies found it relatively easy to get their data into Hadoop but hard to get that data out and use it.
This enormous knowledge gap in accessing big data in Hadoop has prompted an avalanche of vendors to offer SQL-on-Hadoop solutions, which increase the accessibility of Hadoop and allow organizations to reuse their investment learning in SQL. SQL is widely known by most business analysts. Many nontechnical staff without a programming background can write SQL and use traditional business intelligence (BI) tools like Tableau, MicroStrategy, and Business Objects to query data.
In this report, we explore the competitive areas — which we call “disruption vectors” — where vendors will strive to gain an advantage in the SQL-on-Hadoop marketplace: read performance for interactive analytics, mind share, flexibility and compatibility, and operational Hadoop, which is sometimes referred to as “one database to rule them all.”
Key highlights from this report include:
- Cloudera will ultimately lead the SQL-on-Hadoop market, even though it is currently playing catch-up and has a lot of code to write and test. Its open-source solution, Impala, has strong mind share and emphasis on flexibility.
- Close followers are closed-source vendors that are building off mature technology underpinnings: Hadapt, enterprise database vendors EMC (Greenplum) and Teradata, and Citus Data.
- The underdogs are little-known closed-source vendors that are writing everything from scratch and are betting on unconventional technological decisions: JethroData and Splice Machine. Their technology bets are risky but, if obviously superior, could allow them to stage a market coup.
- Looming in the wings are the Apache Foundation project Drill and the Hortonworks initiative Stinger to make Apache Hive faster. These SQL-on-Hadoop projects are native to the Hadoop ecosystem. These projects are the least mature but have solid momentum. If these projects are good enough, they could attract customers away from commercial vendors.
- As a part of our market analysis, we discuss many of the key issues facing decision makers and CIOs who wish to implement SQL on Hadoop successfully within their organization.