If, like many industry watchers, you’ve been confused about EMC Greenplum’s Hadoop strategy over the past couple years, Scott Yara has a message for you: “We’re all in on Hadoop, period.”
Yara, Greenplum’s co-founder and senior vice president of products, has a not-so-coded message for his big data market competitors, too. Put simply, he doesn’t think they stand a chance against his company, and he served notice on Monday morning with the unveiling of the company’s new Pivotal HD Hadoop distribution and Project Hawq in a staged event at San Francisco’s Dogpatch Studios.
Pivotal HD is a completely re-architected Hadoop distribution that has been natively fused with Greenplum’s analytic database (that’s the Project Hawq part), but Yara thinks it’s a bigger deal than just another SQL-on-Hadoop play. In an interview last week, Yara told me that Project Hawq is the manifestation of Greenplum’s decision to sell itself to EMC in 2010, a move he thought would would kickstart his company’s founding vision of becoming the leading big data platform.
Building a data platform costs money, and lots of it
But before the details, a little history. Greenplum’s flagship product is an analytic database powered by a massively parallel processing (MPP) and query engine. The company had raised nearly $100 million in venture capital around this technology since launching in 2003, but doing business in the enterprise software world is hard and expensive, and Greenplum needed more money.
“I thought it was going to take another couple hundred million dollars in investment for us to complete the technical vision we had and go to market,” Yara explained. But finding that kind of money wasn’t so easy in an investment environment where everyone was gaga over social apps like Facebook and Zynga. When EMC approached with a deal like it gave VMware in 2003 — essentially near complete independence bolstered by a huge R&D and marketing budget — Greenplum couldn’t refuse.
Yara said Greenplum had known for a while that Hadoop was the key to any big data strategy going forward, but that it would take some time to build up its own technology. So, in 2011, it entered into a reseller agreement with Hadoop startup MapR to offer a premium product to appease enterprise customers while Greenplum’s engineers got to work on what would become Pivotal HD. That deal with MapR is still in place, but it’s no longer the focal point of Greenplum’s Hadoop strategy.
Big investment, big aspirations
The technology inside Pivotal HD is what companies should come to expect from a Hadoop distribution, Yara explained. It’s essentially the Greenplum Database with its POSIX file system ripped out and replaced by the Hadoop Distributed File System. Whatever users can do on Greenplum’s flagship database, they can do on Pivotal HD, only they can run Hadoop MapReduce jobs and house an HBase database, too.
And when SQL-like features become an important part of Hadoop because it’s so broadly installed that users are now seeking out broader utility, “that’s when the bar gets raised in terms of the amount of capability that’s required,” Yara said. He said Pivotal HD includes years worth of investment in Hadoop cluster-management technology and professional support, too, and that they will cost half as much as what Cloudera and Hortonworks charge. It’s designed to run smoothly wherever customers want it to — physical servers, virtual servers or even cloud servers.
Because they’re so new, he said, competitive SQL-on-Hadoop offerings such as Cloudera’s Impala can only handle about 20 percent of real-world workloads. Looking back at the capital investment in analytics and big data technologies past, things like Netezza, Teradata and Aster Data, Yara proffered, “I don’t think you could build [a full SQL-on-Hadoop] system for less than $25 to $50 million over five years.” (Some of those new technologies, by the way, will have a chance to state their cases during a Structure: Data panel on March 21 that’s all about Hadoop as the next-generation business intelligence platform.)
Greenplum, by contrast, rebuilt its entire R&D team to focus on bringing 10 years of database technology to Hadoop. “We literally have over 300 engineers working on our Hadoop platform,” Yara said. “… We’re bringing all the power of EMC and VMware behind it.”
The data warehouse is the new mainframe
Looking past his competitive boasting, though, it’s easy to see Yara’s greater point when you ask him what all this Hadoop talks means for the data warehouse business on which Greenplum was built. He points to the mainframe business that fell from its high perch decades ago but still drives billions a year in revenue. A single MPP database system is still faster on certain workloads than SQL on Hadoop, but that gap will close over time and “I do think the center of gravity will move toward HDFS,” he said.
Josh Klahr, a Pivotal HD product manager, noted the importance of being able to process all of a company’s data right in a single scalable data store rather than operating numerous systems. He pointed to one customer that’s storing a petabyte of data in Greenplum Database but wants to grow its data volume to 20 petabyes over the next few years and needs something like Hadoop to do that both financially and technically. He said Netflix’s decision to store all its data in Amazon S3 and bring analytic services to it is a good indicator of where the market is headed.
A few years ago, Yara acknowledged, embracing Hadoop as the future might have been a scary proposition. However, he said, “Now, if you don’t embrace Hadooop as the new database platform, if you’re a database vendor, that’s a grave mistake.”