Hadoop: Bigger than SpringSource, JBoss and MySQL combined?

Rob Bearden, new CEO of Hortonworks, the Hadoop startup that spun out of Yahoo in June 2011 (s yhoo), knows a thing or two about making open source software profitable. He was COO at both JBoss and SpringSource leading up to their acquisitions by Red Hat (s rht) and VMware (s vmw), respectively, for a combined total of $770 million. And he thinks Hadoop — the Apache Software Foundation project for storing and processing big, unstructured data — has an opportunity to be bigger than those two companies, as well as open source database MySQL (s orcl), combined.

Enterprises want Hadoop, badly

Bearden told me during a recent call he thinks the Hadoop market will be bigger than the open source application-platform and database markets because of how much value — and net new value — Hadoop brings to companies as they begin blazing their big data trails. Analytics is big business and only getting bigger as companies start trying to analyze data from entirely new sources such as sensors, social media and web pages, and Hadoop is at the core of most such efforts. He thinks Hadoop will be a billion-dollar market in two or three years.

Bearden points to the interest in Hadoop among large enterprises as proof of his thesis. “There’s not a Fortune 500 enterprise out there that doesn’t have three to five proofs of concept around Hadoop going on right now,” he said. The Fortune 200, he told me, is “absolutely” driving interest in Hortonworks’ suite of open source products and services. The story in every customer meeting is about exponentially growing volumes of unstructured data that are dwarfing structured data volumes, and those companies want Hadoop to be the answer.

But enterprise adoption needs a unified front

Rob Bearden

Despite enterprise interest, however, it will still take a lot of work from Hortonworks and its competitors, such as Cloudera and MapR, in order for Hadoop to fulfill its moneymaking potential.

For one, Bearden said, companies pushing Hadoop distributions need to remain as true as possible to the core Hadoop code as it’s maintained within Apache. This way, users don’t need to choose between what version of Hadoop they want to use and risk being locked in, because products all leverage the same core pieces. The easier it is to adopt, the faster companies will adopt it and the more money they’ll be willing to spend on support.

Both Hortonworks and Cloudera peddle Hadoop distributions based entirely on Apache code, although Cloudera does have proprietary management software and the two package the various components differently. They compete for customers, Bearden said, but both companies “are trying to move the ball down the field within the same boundaries.”

Bearden is not a fan of Hadoop distributions that include proprietary components within the core storage and/or processing framework. “[Cloudera CEO] Mike [Olson] and I both agree that MapR is disingenous in its approach,” Bearden said.

MapR sells a Hadoop distribution that includes a proprietary replacement for the Hadoop Distributed File System, and EMC (s emc) bases its Greenplum MR offering on MapR’s M5 product. Presumably, Bearden’s also not a fan of new offerings like those from Fujitsu, which on Monday announced a new Hadoop distribution featuring its own proprietary file system. For what it’s worth, though, both investors and some users seem to think quite highly of MapR as a faster, more reliable alternative to Hadoop.

No dissension in the Hortonworks ranks

As I reported on Feb. 17, Bearden recently took over the role of Hortonworks CEO from co-founder and now-CTO Eric Baldeschwieler. My sources characterized the move as a referendum on Baldeschwieler’s leadership and strategy, but Bearden toed the company line, telling me it was just a matter of needing to scale both the technology and the business side in a hurry. In that sense, it makes more sense to have a technologist as CTO and an experienced open-source business mind as CEO, he said.

“The original thesis was that we could take the core components of Hadoop, add some functionality, and get that out the door in an easily consumable manner,” he told me, but “it has become very clear” that Hadoop actually actually has evolved from being a data repository into a data platform. It can be “the center of gravity for the next generation of data architecture,” he said, but that requires a lot of work both technologically and in developing a workable business model, because it means embracing a partner-heavy strategy.

One thing Hortonworks will not do — and something Cloudera’s Olson has told me before about his company — is move up the stack to start providing analytics applications or tuning Hadoop to specific use cases such as transaction processing. Those capabilities will have to come from partners.

Of course, for Hortonworks to become a truly meaningful part of the Hadoop market, it still needs to have a generally available product. Bearden said the first version of the Hortonworks Data Platform will be available within 60 to 90 days and will be updated with incremental improvements about every 45 days thereafter. Once the platform is open to the public, Bearden said, “there will be many ways to monetize that” in the form of customer support and even more technology partnerships like those Hortonworks has in place with Microsoft (s msft) and Teradata (s tdc).

You can learn a lot more about the future of Hadoop at our Structure:Data conference next month in New York, where executives from Hortonworks, Cloudera and many other Hadoop vendors will take the stage to talk about just that topic.

Image courtesy of Flickr user RachScottHalls.