Laptop Displaying the GigaOm Research Portal

Get your Free GigaOm account today.

Access complimentary GigaOm content by signing up for a FREE GigaOm account today — or upgrade to premium for full access to the GigaOm research catalog. Join now and uncover what you’ve been missing!

Moving Hadoop beyond MapReduce

Table of Contents

  1. Summary
  2. Evolution of Hadoop
  3. The Hadoop ecosystem
  4. The importance of integration
  5. The skills challenge
  6. Pulling it all together
  7. Key takeaways
  8. About Paul Miller

1. Summary

Backed by an extensive open source community and significant investment from startups and more established technology businesses, Hadoop has evolved into a credible platform for supporting enterprise-class analytics at scale. Originally designed to excel at running batch MapReduce jobs over a large static data set on clusters of commodity hardware, the combination of Apache Hadoop with a growing collection of associated projects and products is increasingly capable of far more. With Apache Hadoop 2.0, released in 2013, the project introduced a clear split between management of cluster resources and processing of data. The newly introduced YARN handles resource management across the cluster, and MapReduce has become just one of several tools with which a Hadoop cluster might process and analyze data. Alongside batch processing of static data with MapReduce, Hadoop is increasingly being used to process streaming data with tools like Apache Storm, to explore data interactively with applications that incorporate Apache Tez, or in conjunction with powerful in-memory frameworks like Apache Spark.

These technical advances make Hadoop far more than the one-trick pony it might once have been characterized to be. Parallel innovations around data governance, security, and integration are transforming the Hadoop silo of old into an effective and integral piece of the enterprise IT estate.

This report discusses the capabilities of today’s Hadoop platform and explores ways in which businesses are using it to gain real insight and add real value.

  • In 2013, with the release of Hadoop 2.x, the Apache project separated MapReduce-based data processing from the generic management of cluster resources. Afterwards, the new YARN module in Hadoop enabled a set of data processing options independent of MapReduce.
  • YARN begins to position Hadoop as a viable tool for storing and processing a growing proportion of an organization’s data assets.
  • Issues around data governance and security are key to Hadoop’s broader adoption.
  • Companies like TrueCar and Neustar are embracing Hadoop to enable and accelerate their transformations into profitable data-based organizations.