Blog Post

What it really means when someone says ‘Hadoop’

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

Big data is among the hottest trends in IT right now, and Hadoop stands front and center in the discussion of how to implement a big data strategy. There’s just one problem that keeps cropping up: many people don’t seem to know exactly what it means when somebody says “Hadoop.”

The problem surfaced again Monday in the form of complaints over Forrester’s new report titled “Enterprise Hadoop Solution, Q1 2012.” InformationWeek spoke with a few vendors that didn’t like how their products were assessed, and database industry analyst Curt Monash says the report “compares apples, peaches, almonds, and peanuts.” I thought the same thing when I saw a copy of the report last week. They all focus on Hadoop, but Hortonworks is not Datameer is not HStreaming.

Allow me to explain. Hopefully, this provides a foundation for parsing what people talk about when they talk about Hadoop, and for differentiating one type of product from another. (And you can learn even more about Hadoop and how it’s used at our Structure: Data conference taking place next month in New York City.)

What Hadoop is

I went into this in more detail in a GigaOM Pro report published last March (sub req’d), but the long and short is that Hadoop is, at its core, an Apache Software Foundation project consisting of two primary subprojects — Hadoop MapReduce and the Hadoop Distributed File System. MapReduce is the parallel-processing engine that allows Hadoop to churn through large data sets in relatively short order. HDFS is the distributed file system that lets Hadoop scale across commodity servers and, importantly, store data on the compute nodes in order to boost performance (and potentially save money). These are the two must-have components for any Hadoop distribution.

There are also a number of Apache projects related to Hadoop, often built atop either Hadoop MapReduce or HDFS. These include — but are not limited to — Hive and Pig, two SQL-like query languages to provide data-warehouse-like capabilities to a Hadoop cluster, and HBase, a NoSQL database that leverages HDFS as its distributed storage engine.

Hadoop distributions

These are packaged software products that aim to ease deployment and management of Hadoop clusters compared with simply downloading the various Apache code bases and trying to cobble together a system. Presently, Cloudera, Hortonworks, MapR and EMC (s emc) all offer their own Hadoop distributions. Although they’re all unique — sometimes very unique, as with MapR’s proprietary file system — they all package a set of Hadoop projects (MapReduce, Hive, Sqoop, Pig, etc.) in a way that in theory makes them integrate more naturally, and to run both smoothly and securely.

Many Hadoop distributions integrate with various data warehouses, databases and other data-management products, with the goal of moving data between Hadoop clusters and other environments so each might process or query data stored in the other.

Hadoop management software

Just as the wording implies, Hadoop management software is designed to make it easier to manage and troubleshoot a Hadoop cluster. Such products are usually sold or offered by companies peddling Hadoop distributions, because even when commercially packaged, Hadoop is still a complex architecture and somewhat foreign to most IT personnel and products. However, third parties such as Platform Computing (now part of IBM (s ibm)) and Zettaset also sell software for managing Hadoop clusters, and their products are typically agnostic as to what distributions they support.

But distributions and management software are all about the infrastructure and the platform. Anyone actually wanting to use Hadoop still needs to know how to write applications that leverage the underlying architecture.

Hadoop application software (or, products that use Hadoop)

The Hadoop ecosystem gets really complex when we start looking at products that exist to help developers write Hadoop applications or otherwise analyze data stored within Hadoop in a manner other than writing traditional MapReduce jobs. These range from abstraction layers such as Karmasphere Analyst or IBM Infosphere BigInsights, to Hadapt, which offers a single-platform product fusing a SQL data warehouse with a Hadoop cluster, to HStreaming, which promises real-time processing and analytics.

The one common thing among all these products, however, is that they are not Hadoop distributions, but sit atop platform software from Hortonworks, EMC or whomever. Some products that get thrown into the Hadoop fray, such as Outerthought Lily or Drawn to Scale Spire, are essentially scale-out databases built atop HBase (which itself is a separate project built atop HDFS). The image below, from Karmasphere, gives a particularly clear map of how a Hadoop environment might look.

The applications and analytics space is probably where we’ll see the biggest influx of new companies, as writing Hadoop applications is still tough, but it’s also how companies will actually start experiencing direct business benefits. In fact, it’s these type of higher-level products that are the focal point of Accel Partners’ new big data fund.

12 Responses to “What it really means when someone says ‘Hadoop’”

  1. Nice overview.

    Hadoop and Risk
    Hadoop appears to follow a. Familiar and successful silicon valley pattern that started with the creation, promotion and sales of databases, ERP systems, Yahoo-like portals, Google-like advertising engines, CRM applications and most recently Social Media communities like Twitter and Facebook.

    In each case competing vendors created their own products and educated the market about how their offerings were different and better.

    Now vendors are smarter, they first say how they use Hadoop as a critical value element and then they differentiate by highlighting their unique value.

    Hadoop is more than a product, it has become a cue for success and Risk mitigation.

    Hadoop is a successful product and in Risky times everyone wants a guarantee.

    Hadoop is a ‘success cue’ for companies with the resources to learn how and where to fit it in to their business model.

    Odd, it seems that companies flock to Hadoop to avoid Risk but take on new Risk in learning, building and maintaining.

    This behaviour pattern shows the power of well placed cues, and why cues have to be found and measured by sellers and buyers to achieve the outcome that they need rather than wahts good for someone else

    Nick @ManyCUES

  2. Derek, your Hadoop taxonomy seems more informed and accurate than Forrester’s. What’s often lost in the Hadoop discussion is the question of how data gets into a Hadoop infrastructure in the first place. A case in point – the Karmasphere infographic shows example data sources (the blue boxes) and Hadoop (the yellow box). If we’re to believe the diagram, between the blue and the yellow, a miracle happens.

    To be clear, I’m not picking on Karmasphere. Their diagram serves a purpose and is necessarily simplistic. My point is that miracles don’t happen; organizations need to be incredibly thoughtful about what data they’re going to deliver to Hadoop, where it’s coming from and how it’s connected.

    Equally important is the reality that some data (e.g., from sensor scans, wireless apps, systems monitors, etc.) travels at such high velocities that it will quickly swamp a typical Hadoop/HDFS infrastructure. For high velocity “firehose” applications, the data tier needs a front-end cache that can ingest incoming data at very high speeds, manage that data statefully for real-time analytics, and deliver it to Hadoop in a controlled way. Products like VoltDB (I work for the company) offer an excellent solution for managing the “impedance mismatch” between high velocity data sources and high volume analytic infrustructures like Hadoop.

  3. Flavio Villanustre

    As you correctly point out, the current Hadoop ecosystem is quite complex, and this complexity is not void of problems, which range from the pain to ensure that the different versions of multiple components are compatible, to the overall deployment, management and operation of the system.

    The HPCC Systems platform (, on the other hand, provides for a cohesive and consistent architecture, covering all the functionality in your yellow and green boxes in the diagram above, without requiring any external parts. It is neither based on Hadoop nor part of the Hadoop ecosystem, but a complete replacement (and an upgrade, IMO), which doesn’t suffer from the diversification problem that I mentioned before.

  4. Infobright

    Excellent, clear overview of Hadoop and its ecosystem. Between row databases, columnar databases, NoSQL, Hadoop and even “NewSQL” there is a lot of confusion out there