Cloudera released version 3.0 of its distribution of Apache Hadoop (CDH3) Tuesday, which the company calls the realization of its vision to offer an enterprise-ready Hadoop distribution. CDH3 is a big reason why, despite a recent spate of Hadoop-based big data products either on the market or about to be there, Cloudera says it isn’t sweating all the new competition. Another is that Cloudera doesn’t think competitive vendors have what it takes to cut into Cloudera’s business. Both reasons have some merit, but can Cloudera really afford to laugh off a group of alternative products that’s spreading like wildfire?
Cloudera VP of Product Charles Zedlewski says CDH3 is far superior to any other Hadoop distribution because it already integrates a large suite of tools for improving the Hadoop experiences, including Hive, Pig, Oozie and the HBase NoSQL database. Even users who have downloaded the Apache Hadoop distribution, from where the majority of these tools originate, must manually configure and integrate them with the core Hadoop MapReduce and HDFS components, but CDH3 takes care of all that work.
In addition, he explained, CDH3 supports more platforms than ever before, including the Amazon Web Services (s amzn) and Rackspace (s rax) clouds, as well as a greater number of operating systems. Essentially, Zedlewski said, it’s a matter of offering a complete Hadoop system rather than a better package with which to compile Hadoop tools.
That’s all great, but, as I explained last month after we learned about several new and forthcoming Hadoop-based products from startups and large vendors alike, Cloudera isn’t operating in a vacuum. In fact, some would argue it’s actually operating with a faulty core, because the Apache Hadoop architecture includes single nodes for both Hadoop MapReduce and HDFS that handle all traffic and assign tasks, resulting in inherent concerns of performance bottlenecks and single points of failure. These concerns have sparked third-party alternatives that try to ameloriate the problems, including Appistry’s CloudIQ Storage Hadoop Edition product, which is a wholly-distributed file system intended to replace HDFS, and MapR, a stealth-mode startup that appears to be offering a similar product.
Zedlewski isn’t impressed. Offering a scathing review of Appistry’s product (which, by the way, is only one offering of Appistry’s portfolio) — actually, of all HDFS alternatives — Zedlewski compared it to the ill-fated ParaScale file system. ParaScale was a VC darling a few years ago, but funding ran out last year and the company undertook an assignment for the benefit of creditors (essentially, a state-law asset sale that’s an alternative to federal bankruptcy). Cloudera passed on those assets, Zedlewski noted, adding that the unlikely prospect of Appistry’s product as an alternative to HDFS will be proven “when we get another offer to buy those assets.”
He’s not too worried about IBM (s ibm), either, saying, “every Red Hat needs a SUSE.” In Zedlewski’s analogy, of course, Cloudera is Red Hat and IBM is SUSE (Novell’s less-successful attempt at a commercial Linux OS). But for IBM even to fill that role, he explained, it would first need to get active in the Apache Hadoop community and push Cloudera by actually helping make Hadoop better. At this point, though, IBM only offers a distribution of Hadoop that includes Apache MapReduce and IBM’s GPFS as the file system — a far cry from all the tools baked into CDH3 — and, Zedlewski added, he has yet to find an IBM employee who has ever contributed to Apache Hadoop.
On this point, there might be little room for debate. IBM has acknowledged to me during previous discussions that it hasn’t yet ironed out how it wants to get involved with Apache with regard to Hadoop. Cloudera, for its part, is among the leading Apache Hadoop contributors, along with Yahoo and Facebook (sub req’d). As for the criticism of IBM’s Hadoop distribution, it might be warranted for the time being, but, as I explain in a recent report (sub req’d), IBM is adamant that Hadoop is just one component of a larger big data strategy and its productized Hadoop distribution will be full of enterprise-grade management features.
Zedlewski thinks Cloudera can be the leader in the proprietary Hadoop-management software realm, too, despite the promise of competition from IBM and, now, Platform Computing. Although he acknowledges the pool of users for the company’s core — and free — product, CDH3, is a factor larger than the pool of paid Cloudera Enterprise users, he said the company has had no problem attracting customers for that product. Just over a year into selling Cloudera Enterprise, Zedlewski says it accounts for more than half Cloudera’s revenue, which is rare for an open source company.
He said this activity around Hadoop is just validation of Cloudera’s vision, and it doesn’t change the fact that Apache Hadoop and the related projects also continue to evolve. Yahoo (s yhoo) already is pushing suggestions for eliminating some performance issues, for example, and HBase is gaining popularity rapidly, thanks in large part to Facebook’s extensive use. In fact, Zedlewski said, Cloudera’s paying customers use HBase “at least 30 percent” of the time. Only CDH currently includes the latest and greatest in HBase developments.
As I’ve stated before, Cloudera has plenty of reasons to be confident, but I think the strong words from Zedlewski and other Cloudera executives about its competitors suggest that it’s actually feeling the heat more than it’s letting on. And even if it really isn’t sweating yet, the fire is about to get a lot hotter, so we’ll see for how long it keeps its cool.
Image courtesy of Mick Borroff.