Can Yahoo, Cloudera and IBM Split the Hadoop Pot?

The Wall Street Journal (s nws) appears to have confirmed what we reported last month, which is that Yahoo intends to spin off its Hadoop engineering team into a separate business unit that would compete directly with Cloudera around optimizing and servicing the open-source Apache Hadoop tools. If true, the move almost certainly would be a moneymaker for Yahoo — Hadoop is white-hot right now and very few companies have as much Hadoop clout as does Yahoo — but claims of it being a billion-dollar business might be overstated.

Hadoop might be “the biggest movement in enterprise-software in years,” as Benchmark Capital’s Rob Bearden is quoted as saying the Wall Street Journal article, but it’s not generating a whole lot of money right now. Cloudera’s Mike Olson has told me he sees Cloudera as a billion-dollar business, too, but at approximately 100 customers today, it’s a way off from that mark. Yahoo’s plans as described would make Yahoo the third vendor operating in the same space, alongside Cloudera and IBM (s ibm), and fighting for what, right now, are only speculative customer dollars.

Assuming the Hadoop market can top a billion dollars, though, the toughest task for any Yahoo spinoff might be distinguishing itself from Cloudera. Both are active Apache Hadoop contributors, and both would offer, essentially, souped-up distributions of Apache Hadoop complemented by professional services. Technologically, the main points of innovation would have to be in proprietary tools that build on Apache Hadoop, because both companies now contribute heavily to that project and so, have the same core components to work with.

Justin Borgman, co-Founder and CEO of Hadoop-based database startup Hadapt, thinks both companies have their advantages. He noted Cloudera definitely has the advantage in terms of commercializing Hadoop being out in the field selling it, but Yahoo might have an innovation edge because of its intense internal use of the tools. Yahoo is rightly credited for contributing the majority of code presently implemented into Apache Hadoop — around 70 percent.

However, a couple of key former Yahoo employees — Amr Awadallah and Hadoop creator Doug Cutting — are now at Cloudera. Cloudera also employs Jeff Hammerbacher, who helped spur Facebook’s Hadoop efforts, which included creating the Hadoop data warehouse and query language Hive.

IBM also plans to offer a version of Apache Hadoop that will incorporate a variety of existing IBM scheduling, management and file-system technologies, and which is part of a larger big data strategy within IBM. Big Blue might be the dark horse in this discussion, as it has proven a willingness to invest billions in analytics research and acquisitions over the past few years. It also has been active integrating Hadoop as a foundational element of higher-level products that target business users rather than cluster operators.

For Hadoop users and startups building tools atop Hadoop, though, more competition among distributions is only good news. Borgman noted Hadapt can utilize any Hadoop distribution that might come into the market, which means that its customers will reap the rewards of any improvements that Cloudera, Yahoo or IBM might bring to the table.

Image courtesy of Flickr user