What it means if Yahoo Hadoop spinoff doesn’t do distribution

It looks like all the speculation about how Yahoo’s Hadoop spinoff company, Hortonworks, will affect Cloudera and other companies providing Hadoop-based products might have been overblown. During a phone call earlier this week, Hortonworks CEO Eric Baldeschwieler told me the company is still figuring out its strategy around offering a Hadoop distribution, which could be good news for presumed competitors such as Cloudera.

The ambivalence appears tied to the company’s narrow focus on improving Apache Hadoop and making it the go-to distribution. Baldeschwieler said that Hortonworks’ core business model will be around offering support and services, as well as helping drive Apache to “bridge the gap between what [Hadoop] is and what it can be.” The latter goal, of course, means working hard to improve the core Apache Hadoop distribution to make it more scalable, reliable and generally flexible.

If Hortonworks doesn’t offer a distribution, it might be because it doesn’t want to waste resources. It would have to build its own distribution and then work within Apache to get any improvements built into that code, resulting in a doubling up of effort and a somewhat unnatural split of allegiances given Hortonworks’ professed support for Apache Hadoop. This is the same issue Yahoo was trying to avoid earlier this year when it discontinued its own distribution and recommited all its efforts into Apache Hadoop. It looks now like that move was just setting the stage for the Hortonworks launch.

Already, Baldeschwieler said, a number of key features from Yahoo are slated to be included in upcoming Apache Hadoop releases. These features include a new MapReduce engine, federated storage for HDFS and a major improvement for how HBase interacts with HDFS. What all the work means, he explained, is that Apache Hadoop will be more stable, more scalable and more dynamic. In fact, he said, with the next scheduled release, developers will be able to use alternative processing frameworks beside Hadoop MapReduce.

Good news for some, bad for others

A Hortonworks focused entirely on Apache could be good news for Cloudera. In that case, it’s still very much in its current position of integrating and hardening the suite of Apache Hadoop products into its own open source distribution, then selling services and management software on top of it. The big difference will be that Apache Hadoop will look a lot more appealing because it will have Hortonworks providing expert service. But Cloudera doesn’t really have to change its story.

A service-focused Hortonworks might not be so good for companies such as MapR, which are pushing proprietary or semi-proprietary Hadoop distributions. The fewer distributions and the more focused they are around Apache Hadoop, the less appealing outliers might look to users concerned about being locked into their vendor. Baldeschwieler says he thinks the market will be big enough for value-added distributions like what MapR offers, but noted that Apache Hadoop has already proven itself within large enterprise and will continue to get better.

For example, he explained, Apache has been working hard to integrate some of the code that Facebook has introduced from its Hadoop deployment. At the time it announced its Hadoop distributions in May, EMC said its Community edition is based on Facebook’s code, but now Baldeschwieler has heard EMC is reconsidering that decision and might support the core Apache code instead. That hardly constitutes hard evidence, but it’s noteworthy because EMC is integrating MapR’s proprietary storage technology in its Enterprise edition release.

“What we don’t want to see happen,” Baldeschwieler said, “is the Hadoop market start to look like the Unix market in the ‘80s.” The more support there is around Apache Hadoop, he explained, the less chance there is for a Unix-like lost decade of competing distributions before Linux came around in the ’90s and became the center of the non-Windows universe. He thinks Apache Hadoop is and should be the Linux of big data.

Whatever path Hortonworks takes, though, Baldeschwieler thinks all the action around Hadoop will make it very difficult for alternative technologies, such as Microsoft Dryad and LexisNexis’s HPCC Systems to catch. “I think they’ve got their work cut out for them if they want to compete with the Hadoop community,” he said. Because even if the companies involved are at odds, they’re still a very big community.

Feature image courtesy of Flickr user miheco.