1 Comment

Summary:

The size of Hadoop deployments appears to have tripled since October, according to statistics that Cloudera is sharing. If accurate, they help prove assumptions that Hadoop usage grows quickly once organizations wrap their heads around how it is used.

origami elephant

The size of Hadoop deployments appears to have tripled since October, according to statistics from Cloudera. If accurate, they help prove assumptions that Hadoop usage grows quickly once organizations wrap their heads around how to use it. This seems especially true in industries where better customer insights are tied directly to more revenue.

In his DBMS2 blog this morning, database expert Curt Monash quotes Cloudera VP of Customer Solutions Omer Trajman as stating that his employer counts 22 Hadoop clusters (not counting non-Cloudera users Facebook and Yahoo) in production that are managing more than a petabyte of data apiece. Additionally, Trajman said the average cluster size has grown to more than 200 nodes, roughly tripling since the company conducted a survey of attendees at its Hadoop World conference in October 2010.

I relied in part on that Hadoop World survey in writing a GigaOM Pro report in March (subscription required). What’s interesting, if Trajman’s stats are accurate, is how fast the high end of Hadoop clusters has grown. At that point, the average cluster size was about 67 nodes, but it was less than 40 nodes once you excluded three outliers claiming 999 or 1,000 nodes. Trajman notes that most clusters are still less than 30 nodes, but the increasing number of large clusters is boosting the average size.

Presumably, there are also a greater number of those starter 30-node clusters, which eventually will grow closer to the 200-node average.

This aligns with the findings of a Karmasphere survey conducted contemporaneously with the Hadoop World survey. In the Karmasphere survey, 32 percent of respondents expected to be managing a cluster between 10 and 49 nodes in a year, compared with 36 percent claiming they presently were managing a cluster in that range. However, 61 percent expected to manage clusters between 60 and 1,000-plus nodes in a year, compared with only 32 percent managing that size cluster at the time of the survey.

Also noteworthy is Trajman’s observation that the biggest population of large clusters is in the advertising space. This shouldn’t be surprising: Using analytics to improve targeted advertising directly results in more money. We have highlighted this trend before, specifically as a part of video-advertising strategies, and, earlier this week, as a component of social-media-based ad targeting.

Of course, Facebook, Yahoo and Google (which uses its MapReduce-plus-Big-Table system, on which Hadoop is based) also rely heavily on Hadoop to serve the advertising programs that drive their revenues. And they all have huge deployments spanning multiple clusters around the world. Yahoo now manages more than 42,000 Hadoop nodes.

The amazing thing about this purported growth is that we’re just getting started. Many see a market that is over a billion dollars for Hadoop products, but it’s nowhere near that right now. As Hadoop best practices and technologies improve along with the overall understanding of big data across industries, who knows how large deployments will become. Perhaps today’s SMB will be tomorrow’s Yahoo.

Feature image courtesy of Flickr user miheco

You’re subscribed! If you like, you can update your settings

  1. Charles Zedlewski Wednesday, July 6, 2011

    We’ve been watching Apache Hadoop grow in our customers’ organizations for the past several years and it looks a lot to me like when laser printers became an office mainstay. When you made it very easy and affordable to print things, you saw companies consume a lot more paper.

    Same with Apache Hadoop. It’s much easier and more affordable for users to hold onto and manipulate large sets of data and so they do. That’s one of the main reasons why from the very beginning we opted to price our support offering on a per node basis versus per terabyte which was and still is the industry norm for analytic databases. We wanted to work with this impulse to retain and create more data not fight against it. If you look at improvements in hard disk density and at how more nodes have moved to a 12 drive configuration, the total cost per TB under management for data in an Apache Hadoop system has dropped ~80% over the past 3 years. It’s pretty exceptional for an enterprise system to deliver that kind of value for money.

Comments have been disabled for this post