6 Comments

Summary:

EMC Greenplum rolled out a new Hadoop distribution that fuses the popular big data platform with its flagship MPP database technology. Co-founder Scott Yara thinks the company’s huge investment puts it in the catbird seat among Hadoop vendors.

If, like many industry watchers, you’ve been confused about EMC Greenplum’s Hadoop strategy over the past couple years, Scott Yara has a message for you: “We’re all in on Hadoop, period.”

Yara, Greenplum’s co-founder and senior vice president of products, has a not-so-coded message for his big data market competitors, too. Put simply, he doesn’t think they stand a chance against his company, and he served notice on Monday morning with the unveiling of the company’s new Pivotal HD Hadoop distribution and Project Hawq in a staged event at San Francisco’s Dogpatch Studios.

Pivotal HD is a completely re-architected Hadoop distribution that has been natively fused with Greenplum’s analytic database (that’s the Project Hawq part), but Yara thinks it’s a bigger deal than just another SQL-on-Hadoop play. In an interview last week, Yara told me that Project Hawq is the manifestation of Greenplum’s decision to sell itself to EMC in 2010, a move he thought would would kickstart his company’s founding vision of becoming the leading big data platform.

Building a data platform costs money, and lots of it

But before the details, a little history. Greenplum’s flagship product is an analytic database powered by a massively parallel processing (MPP) and query engine. The company had raised nearly $100 million in venture capital around this technology since launching in 2003, but doing business in the enterprise software world is hard and expensive, and Greenplum needed more money.

Rob Me of Pivotal Labs, Scott Yara of EMC, and Om Malik of GigaOM at Structure:Data 2012

Yara (left) with Pivotal Labs CEO Rob Me and Om Malik at Structure: Data 2012 (c) 2012 Pinar Ozger. pinar@pinarozger.com

“I thought it was going to take another couple hundred million dollars in investment for us to complete the technical vision we had and go to market,” Yara explained. But finding that kind of money wasn’t so easy in an investment environment where everyone was gaga over social apps like Facebook and Zynga. When EMC approached with a deal like it gave VMware in 2003 — essentially near complete independence bolstered by a huge R&D and marketing budget — Greenplum couldn’t refuse.

Yara said Greenplum had known for a while that Hadoop was the key to any big data strategy going forward, but that it would take some time to build up its own technology. So, in 2011, it entered into a reseller agreement with Hadoop startup MapR to offer a premium product to appease enterprise customers while Greenplum’s engineers got to work on what would become Pivotal HD. That deal with MapR is still in place, but it’s no longer the focal point of Greenplum’s Hadoop strategy.

Big investment, big aspirations

The technology inside Pivotal HD is what companies should come to expect from a Hadoop distribution, Yara explained. It’s essentially the Greenplum Database with its POSIX file system ripped out and replaced by the Hadoop Distributed File System. Whatever users can do on Greenplum’s flagship database, they can do on Pivotal HD, only they can run Hadoop MapReduce jobs and house an HBase database, too.

hawq

And when SQL-like features become an important part of Hadoop because it’s so broadly installed that users are now seeking out broader utility, “that’s when the bar gets raised in terms of the amount of capability that’s required,” Yara said. He said Pivotal HD includes years worth of investment in Hadoop cluster-management technology and professional support, too, and that they will cost half as much as what Cloudera and Hortonworks charge. It’s designed to run smoothly wherever customers want it to — physical servers, virtual servers or even cloud servers.

Structure:Data: Put data to work. 60+ big data experts speaking. March 20-21, 2013, New York City. Register now.Because they’re so new, he said, competitive SQL-on-Hadoop offerings such as Cloudera’s Impala can only handle about 20 percent of real-world workloads. Looking back at the capital investment in analytics and big data technologies past, things like Netezza, Teradata and Aster Data, Yara proffered, “I don’t think you could build [a full SQL-on-Hadoop] system for less than $25 to $50 million over five years.” (Some of those new technologies, by the way, will have a chance to state their cases during a Structure: Data panel on March 21 that’s all about Hadoop as the next-generation business intelligence platform.)

Greenplum, by contrast, rebuilt its entire R&D team to focus on bringing 10 years of database technology to Hadoop. “We literally have over 300 engineers working on our Hadoop platform,” Yara said. “… We’re bringing all the power of EMC and VMware behind it.”

The data warehouse is the new mainframe

Looking past his competitive boasting, though, it’s easy to see Yara’s greater point when you ask him what all this Hadoop talks means for the data warehouse business on which Greenplum was built. He points to the mainframe business that fell from its high perch decades ago but still drives billions a year in revenue. A single MPP database system is still faster on certain workloads than SQL on Hadoop, but that gap will close over time and  “I do think the center of gravity will move toward HDFS,” he said.

Josh Klahr, a Pivotal HD product manager, noted the importance of being able to process all of a company’s data right in a single scalable data store rather than operating numerous systems. He pointed to one customer that’s storing a petabyte of data in Greenplum Database but wants to grow its data volume to 20 petabyes over the next few years and needs something like Hadoop to do that both financially and technically. He said Netflix’s decision to store all its data in Amazon S3 and bring analytic services to it is a good indicator of where the market is headed.

A few years ago, Yara acknowledged, embracing Hadoop as the future might have been a scary proposition. However, he said, “Now, if you don’t embrace Hadooop as the new database platform, if you’re a database vendor, that’s a grave mistake.”

  1. This is really awesome !!!

    EMC way ahead in bigdata and hadoop than competitors.

    Share
  2. This looks like a super closed version of CitusData

    Share
  3. Finally everyone coming to realize: the database / Hadoop connector is not the right approach. You have to bring the database technology directly to Hadoop, and make Hadoop your central platform.

    –> Cloudera/Impala, EMC/Project Hawk.

    The guys who pioneered this approach years ago = Hadapt. See the comments from Daniel Abadi: “The SQL-directly-on-Hadoop approach is definitely the way forward, and connectors will eventually die away” over at Monash Research: http://www.dbms2.com/2013/02/25/greenplum-hawq-pivotal-hd/

    Share
  4. Good stuff. We built and shipped V2R1 of Teradata on Unix in 1995 and it’s still the worlds fastest data warehouse. It takes many years, 100′s of millions of dollars and 300-500 qualified engineering team to accomplish this feat. Point solutions are a dime a dozen, but truly RAS and scalable platforms come around once every 10 years and take huge efforts. Kudos to EMC for trying…

    Share
  5. @cameron – you’re living in your past bro..no one is going to build 5 year long EDWs anymore.

    come out of your denial..

    Share
  6. Internally there are only 4 guys working on EMC product.

    Share

Comments have been disabled for this post