5 Comments

Summary:

Big data startup MapR is now an official corporate contributor to the Apache Hadoop project, a somewhat interesting turn of affairs given its corporate mission to lure users away from Apache’s Hadoop Distributed File System. However, other companies commercializing Hadoop shoud follow its lead.

chain

Big data startup MapR is now an official corporate contributor to the Apache Hadoop project, a somewhat interesting turn of affairs given its corporate mission to lure users away from Apache’s Hadoop Distributed File System. Although this might seem like an odd partnership — even more so now after EMC announced MapR as the storage foundation for its Apache Hadoop alternative — it demonstrates the type of cooperation that I think will be necessary for Hadoop to become the IT behemoth that many think it can be.

There’s a widely held and perfectly sensible belief that Hadoop can become as important to IT departments as relational databases are, once every company grasps the importance of advanced analytics and big data strategies become ubiquitous. But that can’t happen if the various pieces of Hadoop aren’t where they need to be in terms of enterprise capabilities. What MapR appears to get is that Hadoop isn’t just the sum of its parts — the project is, in fact, all about parts. The two main parts are MapReduce and HDFS, but there also are related projects such as HBase, Hive, Pig and ZooKeeper, and they all are important in determining Hadoop’s ultimate fate.

MapR thinks HDFS is inherently broken, so it developed its own proprietary software to replace it. It’s not alone; DataStax released a Hadoop distribution that replaces HDFS with the NoSQL data store Cassandra, and other companies are peddling HDFS replacements, too.

But being a plug-and-play alternative for HDFS is a pointless crusade if the rest of the Hadoop pieces — especially MapReduce — aren’t up to par, as well. Which is why, as MapR Chief Application Architect Ted Dunning explains in a blog post on the news, the company is contributing fixes for ZooKeeper, Pig, Flume and Hive, with HBase and MapReduce contributions planned for the next few months. Everyone innovating around Hadoop — from startup Hadapt to giant EMC — should seriously consider following MapR’s lead and get involved with contributing back to Apache Hadoop, if they aren’t already.

One of the great things about Hadoop is that it’s open source, so companies can innovate within or on top of one component and still maintain compatibility with the other components that add a lot of value to the Hadoop experience. But being open source is also one of the bad things about Hadoop, because it means that release cycles and the addition of new features can sometimes lag. Right now, Cloudera, Yahoo and Facebook are shouldering much of the contribution burden (at least with regard to the more-popular projects). As important as Hadoop is to their businesses, they only have so much energy and so many resources to dedicate to Apache activities. Further, this means many of the improvements to the Apache Hadoop code will be based on work that was of particular importance to one of those companies but that might not have much relevance to mainstream Hadoop users.

More contributors means more (presumably) great ideas to choose from and, ideally, more voices deciding what changes to adopt and which ones to leave alone. For individual companies, getting officially involved with Apache means that perhaps Hadoop will evolve in ways that actually benefit their products that are based upon or seeking to improve Hadoop. We’ve talked a lot about “the Hadoop wars” and whether the market can sustain so many distributions and products, but fragmentation and competition at various levels need not be detrimental as long as the core Hadoop pieces remain strong and users know they’re buying into a cohesive ecosystem regardless of which products and projects they choose to adopt.

I have approached MapR for comment and will update this post as necessary should it respond.

Image courtesy of Flickr user slideshow bob.

You’re subscribed! If you like, you can update your settings

  1. MapR wakes up and smells the coffee… Open source needs to be open. Fact is HDFS is working in a bulletproof way because of the community of developers. Yeah maybe it needs to go faster on the features side but bulletproof is what customers want and HDFS is out there now working – and getting better because of the Hadoop community.

  2. I’ve used HDFS in a production setting, and many words come to mind, none of them are “bulletproof” or anything like that. I have lost data due to HDFS bugs on many occasions. Futhermore if you want to run HBase, you must use a non-official ASF version (cloudera release, etc) to get durability. Not to mention the lack of namenode-baked-in-reliability.

    Also, there are many hidden bombs in HDFS, such as the per-node block limit, the NN memory limit, the GC limits in NN and datanode. I’ve been using this for 2 years, and things are not getting better at any appreciable rate.

    1. Ryan good points… that’s fair and I’m sure some of the contributors will pile on to your comments… There is a bigger picture…Ok maybe bulletproof was a bit over the top but look at the success of early community supported open source project that have made it to production in a rapid way that HDFS has. It’s built for a new use case one that many are moving towards. Alternately the other option to “far from bulletproof” is “vaporware”.

      1. The MapR announcements are very short on details, but having actually tested the product I can say that benefits are amazing and MapR is both (a) much faster than hadoop and (b) has an amazing architecture that does not have the above mentioned problems that HDFS has. It’s really tough because an outsider would see just papers, but there is some real substantial code, and lots of testing, behind their statements. I would highly consider building a real time store on top of MapR.

  3. Officially, there is only one distribution of Hadoop: Apache Hadoop. Everything else is a derivative work and cannot call itself “a distribution of Apache Hadoop”. Regardless of the merits of MapR, EMC’s MapReduce stack, etc, they are not Hadoop and should not be confused with it.

    The Hadoop team is clarifying this, and it’s something you should note in your coverage: whatever these things coming out are, they aren’t hadoop, they can’t be called that, and they can’t claim 100% compatibility, given that even Apache Hadoop has regressions.

    see: http://wiki.apache.org/hadoop/Defining Hadoop

    SteveL (hadoop committer, etc).

Comments have been disabled for this post