2 Comments

Summary:

Hadoop is the talk of the town when it comes to big data, but it’s not without faults that have some users begging for an alternative. Like many open source projects, it’s relatively unpolished and often requires a great deal of learning and much strenuous customization […]

talented elephant

Hadoop is the talk of the town when it comes to big data, but it’s not without faults that have some users begging for an alternative. Like many open source projects, it’s relatively unpolished and often requires a great deal of learning and much strenuous customization to make it ready for production deployments. These issues are nothing new, but they’re coming to light more frequently as Hadoop usage picks up across a broad spectrum of companies.

Two examples came to light last week when social-media analytics service Backtype bemoaned the hidden dark side of Hadoop in its tech blog, and when Basho, which uses MapReduce as a function of its Riak NoSQL database, discussed six reasons why MapReduce is a difficult tool. (MapReduce and Hadoop, of course, are not one and the same, but Hadoop MapReduce is one of the two core Apache Hadoop projects, and some of the difficulties of MapReduce programming are universal across all implementations.)

Backtype’s Nathan Marz went as far as to write of Hadoop: “It’s sloppily implemented and requires all sorts of arcane knowledge to operate it. We would be the first to try out a replacement for Hadoop if a viable alternative existed.” Basho was a bit more forgiving, concluding that “MapReduce is hard, but we’re not going shopping just yet.” Not that either really has a choice, though. When it comes to parallel processing for large amounts of unstructured data — especially where cost, open source code and maturity are concerns — MapReduce and, more specifically, Hadoop are about the best things doing.

It’s this situation — a still-difficult toolset that has to serve a greater number of not-necessarily-savvy users — that has led to a recent onslaught of Hadoop startups and projects that aim to make it easier to get results out of Hadoop. We’ve covered the issue ad nauseum through both blog posts and reports (sub req’d) but one more short recap won’t hurt. Essentially, commercial Hadoop pioneer Cloudera continues to harden and integrate the whole suite of Apache Hadoop projects into a single, enterprise-grade distribution, but it’s being joined by a large number of vendors trying to abstract Hadoop and/or MapReduce functions through a variety of targeted products. By mid-May, at least two more options will have been announced.

Hopefully, this new collection of products will help alleviate the learning curve and improve upon some of Hadoop’s present faults, but one of Hadoop’s greatest attributes is its open source nature. This is why I think the real spike in Hadoop innovation has to come from within Apache, not just from software vendors whose improvements will come at a cost. Companies like Cloudera, Yahoo and Facebook are leading the fight right now, but there’s plenty of room for more contributors with their own unique fixes. That includes, as many commenters made clear, everyday users such as Backtype.

  1. Yes, you are totally right: Hadoop and MapReduce are different beasts. Hadoop is an Apache umbrella for different projects: Zookeeper, MapReduce, HBase, HDFS are the major ones among those projects. MapReduce is a over-simplified distributed data processing paradigm reinvented by Google back in early 2000′s.

    Hadoop MapReduce is sloppy – no doubts, but the world is still waiting for not sloppy alternative.

    Share
  2. This could prove useful for companies dealing with Hadoop…
    http://www.pentaho.com/products/hadoop/apache/?fotm=y

    Share

Comments have been disabled for this post