Yahoo spinoff shakes up Hadoop market with new distro

Hortonworks, the Hadoop company that spun out from Yahoo in June (s yhoo) , is getting into the software space after all with Tuesday’s release of the open-source Hortonworks Data Platform. This is a smart business move by Hortonworks, which will need more than name recognition to displace current Hadoop elder statesman Cloudera as well as mega-vendors such as EMC(s emc), Oracle (s orcl) and IBM(s ibm), all of which have their own Hadoop plans in place.

When I last spoke with Hortonworks CEO Eric Baldeschwieler in July, he wasn’t even certain the company would release software, perhaps opting instead to offer only support and services, but there has been a change of heart. Last week, Baldeschwieler told me Hortonworks’ thoughts evolved as it began to understand the market better, and it now realizes a distribution is actually critical to a successful support and services business.

Hortonworks’ goal is to make the use of Apache Hadoop big-data framework ubiquitous (presumably so it has a base of customers to work with.) Doing so requires productizing the core Apache code to make it an easier and more robust experience, he said. What’s more, it’s a lot easier to support those customers when you know what version they’re running and what the potential issues are with that release.

All open source, all Apache

Enter the Hortonworks Data Platform, a completely free, open-source distribution that also includes a cluster-management tool called Ambari and a storage-management service called HCatalog. The latter aims to provide easy connection between Hadoop other data-management systems, as well as interoperability across the Hadoop-specific processing languages of Hadoop MapReduce, Hive, Pig and Streaming. Like Cloudera’s Hadoop distribution, the Hortonworks version will incorporate popular Hadoop projects such as the NoSQL HBase database, Zookeeper, Hive and Pig on top of MapReduce and the Hadoop Distributed File System.

Additionally, the Hortonworks Data Platform will include REST APIs that should ease the task for ISVs wanting to integrate their products with it, and Ambari will eventually include REST APIs for cluster management, Baldeschwieler said.

Baldeschwieler said Hortonworks will still conduct all of its engineering efforts within Apache, which will mean more frequent Hadoop patches, updates and releases, and, consequently, a steady stream of releases from Hortonworks. One of the major criticisms of Hadoop over the years, on top of concerns over complexity and performance, has been a rather slow release cycle.

However, both Hortonworks and Cloudera are confident they can pick up the pace. They think new features around HDFS performance and scale, as well as a next-generation MapReduce that will not only perform better, but that also can support multiple processing engines beside MapReduce, will make Apache Hadoop competitive well into the future.

Can Hortonworks and Cloudera play nice (enough)?

As I explained last month, some think Hortonworks and Cloudera — despite that they’re now direct rivals on both the product and services fronts — will have to keep their egos in check enough to ensure that Apache Hadoop progresses like they say it will. Otherwise, both companies could start losing ground to upstarts such as MapR, which has garnered praise for its proprietary Hadoop distribution that it claims blows away HDFS on performance, as well as to enterprise-savvy vendors EMC, Oracle and IBM.

EMC has an open-source Hadoop product that uses code contributed by Facebook but isn’t currently integrated into Apache Hadoop, as well as an enterprise version for which it distributes the MapR software as part of its distro. IBM’s main Hadoop focus appears to be BigInsights, which buries Apache Hadoop underneath a slew of other analytics and visualization tools to create a product in which Hadoop is just a component.

Baldeschwieler characterizes the relationship between Hortonworks and Cloudera as “coopetition,” adding that it’s probably a better relationship than it was in previous years, ostensibly because of their shared vision that Apache Hadoop is the best bet for storing and processing mountains of unstructured data.

“The challenge for us as a community is to make it clear to the world that Apache Hadoop … is a very mature technology,” he said, one that has been used in production by everyone from Yahoo to JPMorgan Chase(s jpm). “Whitepaper differentiation is not the same as having been in production at massive scale for a number of years.”

Feature image courtesy of Hortonworks; fighting elephants image courtesy of Flickr user