Weekly Update

Hadoop gets more versatile, but data is still king

Hadoop, the open source big-data framework, has gradually evolved from being a shiny object in the laboratory to an increasingly serious tool finding its place in the Enterprise. At Gigaom, we’ve covered Hadoop’s increasing maturity, and completeness as an enterprise technology, because that’s been the story. But now the story is changing.

The change emanates from the release of Hadoop 2.0 and the rapid standardization on that release’s new resource management engine, YARN. YARN moves Hadoop out of the world of batch processing, to the interactive world. We’ve covered that too. But while the change in the story builds on that facet of YARN, it pivots rather dramatically.

It’s a YARN world, we’re just computing in it

Hadoop used to be a self-contained product. Now it’s a platform – a stage that showcases a number of other products. Whether Spark or Storm or Cascading orDrill or whatever the next big thing in open source project land is, it’s going to run on YARN, it’s going to use HDFS storage and it’s probably going to gather momentum and broad acceptance pretty quickly.

That’s a pretty radical change. It moves Hadoop down the stack. It converts Hadoop from being an exposed technology that’s a developer destination to an embedded technology that will become a developer service. Suddenly, Hadoop’s role is quite similar to that of a runtime, or even of an operating system. And once that happens, Hadoop needn’t be a Big Data tool exclusively. Instead, it can just become a general-purpose platform for distributed computing and distributed storage. Hadoop can be a workflow engine, it can be a service bus, it can be an object broker, it can be a content management system.

From BD to SI?

If Hadoop does become all those things, what will that mean for Hadoop distribution vendors, like  Cloudera, Hortonworks and MapR? Will they still be big data companies, or will they become OS vendors that double as systems integrators for distributed computing implementations?

I suppose you could argue that in fact these companies already are integrators: a big part of their job is picking and choosing what Hadoop-affiliated components to include in their distros and making sure they all work together. Since all of these components are open source, customers could do this work themselves. But if the Hadoop vendors do it for them, then they provide quite a valuable integration service.

Identity crisis

So if Hadoop won’t be a big data tool anymore and Hadoop vendors won’t be big data companies, then what’s really going on? Where’s the equilibrium? Where’s the center of gravity in the Hadoop ecosystem? Is Hadoop really just about an abstraction layer for distributed processing, and a bunch of grease monkey work to hook it all together? Is the Hadoop space careening toward a bland and agnostic world of general purpose computing?

It’s true that a big part of Hadoop’s success is that it essentially provides for supercomputing and robust storage on commodity hardware, rather than proprietary appliances requiring military-level spending. So the lure of Hadoop as a general-purpose computing medium is strong, because it disrupts a pretty expensive status quo in enterprise infrastructure.

Take it back home

But for Hadoop, data is still where it’s at. Data is Hadoop’s center of gravity. Data is the thread that ties the motley crew of open source projects together. Data is still the motivator, and data is still the prize. Because when you have a general purpose compute and storage framework with compelling economics, then also you have a place where data accumulates, even when it’s merely a byproduct of what looks like a non-data-focused application.

Hadoop is a data collecting point, and it’s also a place where code that analyzes all of that collected data can run really well. That’s critical, because data is the lifeblood of all business software. In fact, as even Hadoop 1.0 proved, data is also the lifeblood of business itself.

When you have a compute and storage framework that embraces data’s primacy – that celebrates it, obsesses over it, exploits it, shares it and facilitates a virtuous cycle around it – then you have a winner. YARN has facilitated that win. That’s the new story.