Big data startup MapR is now an official corporate contributor to the Apache Hadoop project, a somewhat interesting turn of affairs given its corporate mission to lure users away from Apache’s Hadoop Distributed File System. Although this might seem like an odd partnership — even more so now after EMC (s emc) announced MapR as the storage foundation for its Apache Hadoop alternative — it demonstrates the type of cooperation that I think will be necessary for Hadoop to become the IT behemoth that many think it can be.
There’s a widely held and perfectly sensible belief that Hadoop can become as important to IT departments as relational databases are, once every company grasps the importance of advanced analytics and big data strategies become ubiquitous. But that can’t happen if the various pieces of Hadoop aren’t where they need to be in terms of enterprise capabilities. What MapR appears to get is that Hadoop isn’t just the sum of its parts — the project is, in fact, all about parts. The two main parts are MapReduce and HDFS, but there also are related projects such as HBase, Hive, Pig and ZooKeeper, and they all are important in determining Hadoop’s ultimate fate.
MapR thinks HDFS is inherently broken, so it developed its own proprietary software to replace it. It’s not alone; DataStax released a Hadoop distribution that replaces HDFS with the NoSQL data store Cassandra, and other companies are peddling HDFS replacements, too.
But being a plug-and-play alternative for HDFS is a pointless crusade if the rest of the Hadoop pieces — especially MapReduce — aren’t up to par, as well. Which is why, as MapR Chief Application Architect Ted Dunning explains in a blog post on the news, the company is contributing fixes for ZooKeeper, Pig, Flume and Hive, with HBase and MapReduce contributions planned for the next few months. Everyone innovating around Hadoop — from startup Hadapt to giant EMC — should seriously consider following MapR’s lead and get involved with contributing back to Apache Hadoop, if they aren’t already.
One of the great things about Hadoop is that it’s open source, so companies can innovate within or on top of one component and still maintain compatibility with the other components that add a lot of value to the Hadoop experience. But being open source is also one of the bad things about Hadoop, because it means that release cycles and the addition of new features can sometimes lag. Right now, Cloudera, Yahoo (s yhoo) and Facebook are shouldering much of the contribution burden (at least with regard to the more-popular projects). As important as Hadoop is to their businesses, they only have so much energy and so many resources to dedicate to Apache activities. Further, this means many of the improvements to the Apache Hadoop code will be based on work that was of particular importance to one of those companies but that might not have much relevance to mainstream Hadoop users.
More contributors means more (presumably) great ideas to choose from and, ideally, more voices deciding what changes to adopt and which ones to leave alone. For individual companies, getting officially involved with Apache means that perhaps Hadoop will evolve in ways that actually benefit their products that are based upon or seeking to improve Hadoop. We’ve talked a lot about “the Hadoop wars” and whether the market can sustain so many distributions and products, but fragmentation and competition at various levels need not be detrimental as long as the core Hadoop pieces remain strong and users know they’re buying into a cohesive ecosystem regardless of which products and projects they choose to adopt.
I have approached MapR for comment and will update this post as necessary should it respond.
Image courtesy of Flickr user slideshow bob.