30 Comments

Summary:

For better or worse, Hadoop has become synonymous with big data. In just a few years it has gone from a fringe technology to the de facto standard. But is the enterprise buying into a technology whose best day has already passed?

elephant walking away

Hadoop is everywhere. For better or worse, it has become synonymous with big data. In just a few years it has gone from a fringe technology to the de facto standard. Want to be big bata or enterprise analytics or BI-compliant?  You better play well with Hadoop.

It’s therefore far from controversial to say that Hadoop is firmly planted in the enterprise as the big data standard and will likely remain firmly entrenched for at least another decade. But, building on some previous discussion, I’m going to go out on a limb and ask, “Is the enterprise buying into a technology whose best day has already passed?”

First, there were Google File System and Google MapReduce

To study this question we need to return to Hadoop’s inspiration – Google’s MapReduce. Confronted with a data explosion, Google engineers Jeff Dean and Sanjay Ghemawat architected (and published!) two seminal systems: the Google File System (GFS) and Google MapReduce (GMR). The former was a brilliantly pragmatic solution to exabyte-scale data management using commodity hardware. The latter was an equally brilliant implementation of a long-standing design pattern applied to massively parallel processing of said data on said commodity machines.

GMR’s brilliance was to make big data processing approachable to Google’s typical user/developer and to make it fast and fault tolerant. Simply put, it boiled data processing at scale down to the bare essentials and took care of everything else. GFS and GMR became the core of the processing engine used to crawl, analyze, and rank web pages into the giant inverted index that we all use daily at google.com. This was clearly a major advantage for Google.

Enter reverse engineering in the open source world, and, voila, Apache Hadoop — comprised of the Hadoop Distributed File System and Hadoop MapReduce — was born in the image of GFS and GMR. Yes, Hadoop is developing into an ecosystem of projects that touch nearly all parts of data management and processing. But, at its core, it is a MapReduce system. Your code is turned into map and reduce jobs, and Hadoop runs those jobs for you.

Then Google evolved. Can Hadoop catch up?

Most interesting to me, however, is that GMR no longer holds such prominence in the Google stack. Just as the enterprise is locking into MapReduce, Google seems to be moving past it. In fact, many of the technologies I’m going to discuss below aren’t even new; they date back the second half of the last decade, mere years after the seminal GMR paper was in print.

Here are technologies that I hope will ultimately seed the post-Hadoop era. While many Apache projects and commercial Hadoop distributions are actively trying to address some of the issues below via technologies and features such as HBase, Hive and Next-Generation MapReduce (aka YARN), it is my opinion that it will require new, non-MapReduce-based architectures that leverage the Hadoop core (HDFS and Zookeeper) to truly compete with Google’s technology. (A more technical exposition with published benchmarks is available at http://www.slideshare.net/mlmilleratmit/gluecon-miller-horizon.)

Percolator for incremental indexing and analysis of frequently changing datasets. Hadoop is a big machine. Once you get it up to speed it’s great at crunching your data. Get the disks spinning forward as fast as you can. However, each time you want to analyze the data (say after adding, modifying or deleting data) you have to stream over the entire dataset. If your dataset is always growing, this means your analysis time also grows without bound.

So, how does Google manage to make its search results increasingly real-time? By displacing GMR in favor of an incremental processing engine called Percolator. By dealing only with new, modified, or deleted documents and using secondary indices to efficiently catalog and query the resulting output, Google was able to dramatically decrease the time to value. As the authors of the Percolator paper write, ”[C]onverting the indexing system to an incremental system … reduced the average document processing latency by a factor of 100.” This means that new content on the Web could be indexed 100 times faster than possible using the MapReduce system!

Coming from the Large Hadron Collider (an ever-growing big data corpus), this topic is near and dear to my heart. Some datasets simply never stop growing. It is why we baked a similar approach deep into the Cloudant data layer service, it is why trigger-based processing is now available in HBase, and it is a primary reason that Twitter Storm is gaining momentum for real-time processing of stream data.

Dremel for ad hoc analytics. Google and the Hadoop ecosystem worked very hard to make MapReduce an approachable tool for ad hoc analyses. From Sawzall through Pig and Hive, many interface layers have been built. Yet, for all of the SQL-like familiarity, they ignore one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (jobs). It is baked from the core for workflows, not ad hoc exploration.

In stark contrast, many BI/analytics queries are fundamentally ad hoc, interactive, low-latency analyses. Not only is writing map and reduce workflows prohibitive for many analysts, but waiting minutes for jobs to start and hours for workflows to complete is not conducive to the interactive experience. Therefore, Google invented Dremel (now exposed as the BigQuery product) as a purpose-built tool to allow analysts to scan over petabytes of data in seconds to answer ad hoc queries and, presumably, power compelling visualizations.

Google BigQuery

Google’s Dremel paper says it is “capable of running aggregation queries over trillions of rows in seconds,” and the same paper notes that running identical queries in standard MapReduce is approximately 100 times slower than in Dremel. Most impressive, however, is real world data from production systems at Google, where the vast majority of Dremel queries complete in less than 10 seconds, a time well below the typical latencies of even beginning execution of a MapReduce workflow and its associated jobs.

Interestingly, I’m not aware of any compelling open source alternatives to Dremel at the time of this writing and consider this a fantastic BI/analytics opportunity.

Pregel for analyzing graph data. Google MapReduce was purpose-built for crawling and analyzing the world’s largest graph data structure – the internet. However, certain core assumptions of MapReduce are at fundamental odds with analyzing networks of people, telecommunications equipment, documents and other graph data structures. For example, calculation of the single-source shortest path (SSSP) through a graph requires copying the graph forward to future MapReduce passes, an amazingly inefficient approach and simply untenable at scale.

Therefore, Google built Pregel, a large bulk synchronous processing application for petabyte -scale graph processing on distributed commodity machines. The results are impressive. In contrast to Hadoop, which often causes exponential data amplification in graph processing, Pregel is able to naturally and efficiently execute graph algorithms such as SSSP or PageRank in dramatically shorter time and with significantly less complicated code. Most stunning is the published data demonstrating processing on billions of nodes with trillions of edges in mere minutes, with a near linear scaling of execution time with graph size.

At the time of writing, the only viable option in the open source world is Giraph, an early Apache incubator project that leverages HDFS and Zookeeper. There’s another project called Golden Orb available on GitHub.

In summary, Hadoop is an incredible tool for large-scale data processing on clusters of commodity hardware. But if you’re trying to process dynamic data sets, ad-hoc analytics or graph data structures, Google’s own actions clearly demonstrate better alternatives to the MapReduce paradigm. Percolator, Dremel and Pregel make an impressive trio and comprise the new canon of big data. I would be shocked if they don’t have a similar impact on IT as Google’s original big three of GFS, GMR, and BigTable have had.

Mike Miller (@mlmilleratmit) is chief scientist and co-founder at Cloudant, and Affiliate Professor of Particle Physics at University of Washington.

Feature image courtesy of Shutterstock user Jason Prince; evolution of the wheel image courtesy of Shutterstock user James Steidl.

  1. Nathan Milford Saturday, July 7, 2012

    I feel like an important paradigm you missed is the Event Stream Processor, like Storm (http://storm-project.net/).

    See: http://blog.sematext.com/2011/09/26/event-stream-processor-matrix

    We’ve got a small Storm cluster we’re evaluating for real-time stuff. The plan is that as logs come off our Tomcat nodes, they stream to the back end via Kafka (http://incubator.apache.org/kafka/) where they go to both Storm for real-time analysis and Hadoop for more in-depth off-line analysis and research.

    IMHO, the combo of Storm and small, dense and power efficient ARM nodes (a la http://gigaom.com/cloud/see-what-cloud-can-do-dell-unveils-arm-servers/) will be the new hotness.

    Share
    1. Nikita Ivanov Saturday, July 7, 2012

      We’ve seen simliar architectures using GridGain’s In-Memory Data Platform. As streaming data gets in, it gets stored in in-memory cluster and in parallel gets saved into HDFS. In-memory grid is used for sub-second analysis for TBs of data, HDFS is used as traditional data warehouse (offline analytics)…. best of two worlds.

      Share
    2. Mike Miller Sunday, July 8, 2012

      Hi, I mentioned Storm above (in the Percolator) but didn’t have time to go into detail. We’ve watched it grow since we were rubbing elbows and trading ideas in Boston with Backtype (YCS08 as well).

      Share
  2. Michael E. Driscoll Saturday, July 7, 2012

    You stated “I’m not aware of any compelling open source alternatives to Dremel… a fantastic BI/analytics opportunity.” I would encourage you to keep an eye on Metamarkets’ Druid, which Curt Monash recently covered: http://www.dbms2.com/2012/06/16/metamarkets-druid-overview/

    I strongly agree that ad-hoc exploration is a weakness of Hadoop-based analytics workflows, and a number of players are emerging to address this pain point, not least of which is SAP’s HANA appliance. Qliktech’s strength as a BI product is precisely its ease in enabling fast, exploratory queries, although it remains a desktop product.

    In the coming decade, the disruption to the BI market will be driven by those who deliver solutions, not tools. And those solutions won’t be delivered by a big box or confined to a Windows desktop; they will be cloud-backed, web-delivered services.

    (Disclosure: I am the CEO at Metamarkets, so I admittedly have a dog in this hunt).

    Share
    1. Steve Ardire Saturday, July 7, 2012

      Hi Mike agree but word-smithed your conclusion a bit ;)

      In the coming decade, the disruption to the Big Data market ( which includes next gen BI ) will be driven by those who deliver solutions. And those solutions will be cloud-backed PaaS services.

      Share
      1. Sankar Nagarajan Wednesday, July 11, 2012

        Good point. The industry will need solutions similar to what Google has built and there is a good scope for ISVs , opensource and commercial alike..

        Share
  3. Sorry, Miller, you make very wrong comparison. All Google tools are proprietary for their own play. However cool they are, you just can’t touch it or you better pay pretty money to use it.
    Hadoop and its ecosystem are completely open-source. You installed them freely inside your company on your own hardware. They may be slower but they could do the job they’re asked to do cheaper and efficient.
    Your comparison makes no sense at all.

    Share
    1. Mircea Strugaru Sunday, July 8, 2012

      Tim, it is naïve to think that open source is…free. The cost required to master the technology, to develop on top of it and then to maintain and patch it – that cost is pretty big and is often driving companies to purchase licenses for non-open source tools. Therefore, it IS definitly worth comparing free and non-free technologies.

      Share
      1. good point man ;)

        Share
    2. I think that doesn’t do justice to open source. It is not merely the low-cost inferior alternative. The goal to aim for is to be better than proprietary solutions, or equally good at the least, and if it is not then there is work to be done!

      Share
    3. I think that doesn’t do justice to open source. It is not merely the low-cost inferior alternative. The goal to aim for is to be better than proprietary solutions, or equally good at the least, and if it is not then there is work to be done! They are equals and I do not see why you could not compare them.

      Share
  4. Mike, Interesting post. My New York Times article “Beyond Hadoop: Next-Generation Big Data Architectures”, October 2010

    http://nyti.ms/9JnDlS

    makes similar points, contrasting Hadoop with Cloudscale, MPI, BSP, Pregel, Dremel and Percolator. As co-inventor of the bulk synchronous parallel (BSP) programming model, it has been obvious to me, and to others I know at Google, Facebook and elsewhere, that BSP, as a much more flexible generalization of both MapReduce and Pregel, may be a better place to start than the very restricted and inefficient MapReduce model, at least if you want to do realtime analytics, graph analytics, or big data supercomputing, as well as the kinds of embarrassingly parallel apps that Hadoop is currently used for. There’s more on this and other related stuff at cloudscale.com, my current startup. See also the BSP topic on Quora

    http://www.quora.com/BSP

    At Cloudscale we use Lustre rather than HDFS to get HadoopMapReduce, realtime analytics, MPI, BSP, graph etc. in a unified system that can offer extremely high performance and low latency at massive scale.

    Bill

    Share
  5. You forgot graphlab http://graphlab.org/

    It’s for doing algorithms on graphs similar to what Giraph does, but much more efficiently. The data may have to fit in the memory of the cluster (I’m not sure about that).

    Share
  6. > At the time of writing, the only viable option in the open source world is Giraph, an early Apache incubator project that leverages HDFS and Zookeeper. There’s another project called Golden Orb available on GitHub.

    The above was mentioned in the context of pregel/graph processing. Apache Hama is another option which needs to be considered. 0.5 release has the Google Pregel implemented. One thing to note is while Giraph is only for graph processing, Hama is a pure BSP engine on which a lot can be done besides graph processing. Also, GoldenOrb had not been active from August, 2011.

    Recently, Apache Hama/Giraph have moved from Apache Incubator to Apache TLP (Top Level Project).

    Although Apache Hama/Giraph can be used for graph processing, both of them have just a few graph processing algorithms implemented. But, both of them provide simple API to write new graph processing algorithms easily. Also, there are few/none instances of these frameworks in production.

    Because of the ecosystem/$$$/hype around Hadoop, many try to fit Hadoop/MR as a solution for every problem. It’s time to look for other models also.

    Share
  7. Let me see if I understand this article. Google implements MapReduce. Open-source implements MapReduce. Google uses other technologies. (Open-source, of course, already uses other technologies.) Therefore, journalist reports that open-source MapReduce is dying. Journalist “hopes” that a few specific alternatives will get traction.

    Every part of this article that I see is either tautological, or misleading.

    Share
    1. Derrick Harris Sunday, July 8, 2012

      Actually, to be clear, Mike is a technologist and a physics professor. Had I wrote this, you’d be correct on the journalist part. As for open source MapReduce, I think Mike’s right that it’s already yesterday’s news, but that doesn’t mean it doesn’t have a place. But as you speak with companies that already have batch Hadoop workflows in place, they’re definitely looking for what’s next, with real-time and stream processing being chief among their desires.

      Share
  8. Jonathan Hendler Sunday, July 8, 2012

    Titan is a recent entry into graph databases. This is an open source distributed graph database that can support either Cassandra or Hbase on the backend:

    http://thinkaurelius.github.com/titan/

    Great presentation here:
    http://www.slideshare.net/slidarko/titan-the-rise-of-big-graph-data

    Share
  9. Mikio L. Braun Monday, July 9, 2012

    Great overview article!

    Just wanted to point out the link to your slides is broken.

    Share
    1. Mikio L. Braun Monday, July 9, 2012

      You might probably also be interested in this blog post of mine where I discuss the issue of real-time processing: http://blog.mikiobraun.de/2011/10/one-does-not-simply-scale-into-realtime-processing.html

      Share
    2. mikebroberg Monday, July 9, 2012

      If you copy & paste the URL, it works correctly.

      Share
  10. rohit sharma Monday, July 9, 2012

    very informative – thanks.
    also, see Vassovary for executing complex algorithms on Big graphs (Billions of nodes) :: http://engineering.twitter.com/2012/03/cassovary-big-graph-processing-library.html

    Share

Comments have been disabled for this post