The fight for Hadoop dominance is officially on. The unveiling of Yahoo’s Hadoop spinoff Hortonworks will undoubtedly be the talk of today’s Hadoop Summit, but it’s not the only game in town. In fact, while Hortonworks is busy answering questions about its product strategy, Cloudera and MapR will demonstrate new versions of their distributions overflowing with bells and whistles.
I wrote yesterday about the importance of new tools designed to improve the Hadoop experience at a level above the distribution layer, but the distribution — the underlying code base that defines Hadoop’s core architecture and capabilities — is still king. Apache Hadoop is a set of open-source tools designed to enable the storage and processing of large amounts of unstructured data across a cluster of servers. Chief among those tools are Hadoop MapReduce and the Hadoop Distributed File System (HDFS), but there are numerous related ones, including Hive, Pig, HBase and ZooKeeper.
Most vendors try to distinguish their Hadoop distributions with MapReduce and HDFS. Some will try to tweak the core Apache features and architectures, while others will replace one component — generally HDFS — altogether.
EMC and IDC released their Digital Universe study this week, estimating we’ll create 1.8 zettabytes of data this year and data growth is outpacing Moore’s Law. Now that we’ve realized there’s value in all that information, we’re anxious to capture, analyze and use it, and that requires more and better big-data technology. As this diagram from Karmasphere illustrates, Hadoop is a very large part of the big data stack, which means we’re just getting started.
So many distributions, so little time
Cloudera. Cloudera, whose CDH was the first commercial Hadoop distribution, takes the approach of taking the full complement of available open-source components and integrating them into an enterprise-grade product. Its value isn’t so much in “improving” Hadoop as it is in making everything from Hadoop MapReduce to its own Sqoop (SQL to Hadoop) tool work well together out of the box.
Cloudera actually released CDH version 3.5 recently, but today it released a bunch of new features for its Cloudera Enterprise product, a suite of management tools designed to make it easier to operate CDH clusters. The coolest has to be something called SCM Express, which makes getting started with Hadoop easier. Cloudera’s Charles Zedlewski explained that SCM Express is a free tool that lets users provision and launch up to a 50-node Hadoop cluster “in about six clicks.”
MapR. However, Cloudera has lots of company, including the brand new MapR. That startup just released its first two products today: a free Hadoop distribution called M3 and a paid distribution called M5. MapR takes the Cloudera approach of integrating the entire spectrum of Hadoop tools into its distribution and including management functionality, but it also has made a number of significant changes to the MapReduce and HDFS components to improve performance.
MapR’s Jack Norris says the result is “probably the most comprehensive distribution,” which performs two to five times faster than the standard Apache Hadoop. A majority of MapR’s changes are to the storage layer, which it has reworked to be faster, easier, more reliable and more scalable than HDFS.
You can’t talk about MapR without talking about EMC, which announced last month that the Enterprise Edition of its Greenplum HD Hadoop distribution will be “powered by MapR.” Norris explained to me that the product, available later this year, will utilize MapR’s M5 version, which includes advanced storage capabilities around high availability and data protection. However, EMC’s line of Greenplum HD distributions, which also includes a free Community Edition, is actually centered around the specialized Hadoop code developed by and running within Facebook.
Of course, Hortonworks isn’t to be discounted, nor are DataStax with its Cassandra-based Brisk distribution or IBM, which has been promising its own Big-Blue-style Hadoop distribution for some time. But the most interesting thing about all this Hadoop activity might be the pace of it: As of mid-March, Cloudera stood alone as a commercial Hadoop provider. Now it has four competitors with more likely to come.
Feature image courtesy of Flickr user Joi.