Blog Post

‘My Hadoop is bigger than yours’

Money has turned the Hadoop community, once united under the Apache banner and the cuddly stuffed-toy-elephant logo, into something resembling a frat house: Everyone’s under the same roof, but there’s plenty of machismo to go around. I’m not sure it makes for good business, but it sure does make for good theater.

I was alerted to the latest round of Hadoop one-upmanship last week (hat tip to Alex Popescu of myNoSQL), when HPCC Systems (a company pushing an alternative framework to Hadoop, actually) claimed it had set a Terasort benchmark record. It said it had done so using a far smaller cluster than what was used in the previous record-setting test by SGI (s sgi) running Cloudera’s Hadoop distribution on SGI hardware.

The SGI benchmark was announced in October, and drew the attention of MapR Co-Founder and CTO M.C. Srivas a few weeks later. In a November blog post, Srivas questioned the validity of the SGI test because it only used a 100GB data set to run a benchmark designed to churn through a terabyte. By his reasoning, that’s problematic in large part because the test system had enough memory to fit the entire data set, meaning it didn’t have to deal with the slowdown caused by accessing data stored on disk.

Srivas claimed MapR had actually run the benchmark in the same time as SGI using half the hardware, only it did so using a terabyte data set. I have to assume Srivas isn’t too impressed with the HPCC record test, then. It was accomplished using even less hardware than MapR claims to have used — four nodes as opposed to 10 nodes — but it also was on a 100GB data set.

In early October, I wrote about a war of words (and graphs) that erupted between Cloudera and Hortonworks over which company contributed the most code to the Apache Hadoop project. Their frenemy status (or “coopetition,” as Hortonworks CEO Eric Baldeschwieler put it) is understandable, but questionably beneficial. As one developer pointed out in my post, the companies might be better off trying to improve Apache Hadoop rather than sniping back and forth.

Datameer Founder and CEO Stefan Groschupf — the man responsible for this post’s feature image and headline — agrees. He wrote on his blog in mid-October:

So now we find our partners and friends sparring over whose contribution is bigger than the others. Frankly, this is all surprising to me since we have so much more work to do to move Hadoop forward. Don’t get me wrong, we love Hadoop for what it is but we all can agree that the code is still a work in progress, monolithic, difficult to test and concepts like inversion of control do not exist… I could go on for a while.

All the companies involved claim to be doing well enough and certainly have raised a lot of venture capital, but Microsoft (s msft) vs Google(s goog), or Oracle (s orcl) versus IBM (s ibm) this is not. That being said, the prospect of luring large enterprise customers and growing to the size of any of those companies makes all this back-and-forth between Hadoop vendors a little more understandable.

Cloudera, Hortonworks, MapR and their peers aren’t just promoting an open-source project anymore; they’re pushing products; so I don’t see the boasts slowing down any time soon as Hadoop gets even more big-business attention.