13 Comments

Summary:

It turns out that “big data” isn’t just a buzzword, but a legitimate concern for companies across the board. Their interest in the tools to take advantage of the opportunity for data analysis has sparked a land grab among software vendors centered around Hadoop.

fighting elephants

It turns out “big data” isn’t just a buzzword, but a legitimate concern for companies across the board. Their interest in the tools to take advantage of the opportunity for analysis of all this data has sparked a land grab among established vendors and startups alike. The action is centered around Hadoop, the flagship technology for storing and processing large amounts of unstructured data.

Since Yahoo open-sourced Hadoop a few years ago, the primary options for organizations wanting to take advantage of the product have been the open-source Apache Hadoop distribution, the Cloudera distribution of Hadoop ,and Amazon Web Services’ Elastic MapReduce service. That will change soon, as everyone from EMC and IBM to database startups like Hadapt and DataStax get into the business of selling Hadoop-based technologies and services.

So far, Cloudera, which provides commercial support for its open-source distribution, as well as its own proprietary Hadoop-cluster management software, has been the only company to truly capitalize on Hadoop financially. Arguably, its success is to blame for the stiff competition it’s about to face for companies’ Hadoop attention and dollars.

Too Many Distributions

Cloudera, a private company, hasn’t released any financial details, but Wednesday at Structure: Big Data, VP of Engineering Amr Adawallah mentioned during a panel that Cloudera has more than 80 customers running Hadoop in production, and the company does have technology partnerships across the data world, including a leading data warehouse, BI, and database vendors. Cloudera also has raised $36 million from investors since launching in 2009. It appears the other software companies have noticed all the activity around Cloudera and want in on some of the action.

IBM already has a Hadoop business that includes its own distribution it says is better suited for commercial users than the open-source Apache Hadoop distribution, though both IBM and  Cloudera are based on the Apache distribution. IBM’s offering also provides an application called InfoSphere BigSheets, which hides the complexities of Hadoop underneath a variety of advanced analytics, BI and visualization tools. Based on a few sources I spoke with at Structure: Big Data, and after reading into an advertisement in the program for the conference , it looks EMC is getting into the game. The ad hints that EMC will announce a Hadoop product involving its new Greenplum database on May 9: The ad read, “05.09.11: EMC Greenplum. Apache Hadoop.” Also at the event, two independent sources suggested members of Yahoo’s Hadoop team will be spinning off their own separate business, and there is speculation this move is somehow tied into EMC’s Hadoop plans.

IBM isn’t to be taken lightly, nor is EMC on its own, but the latter turn of events would be a potentially market-changing situation given the Hadoop know-how within Yahoo, which has contributed the majority of the code now included in Apache Hadoop. During a panel at Structure: Big Data, Yahoo’s VP of Cloud Architecture Todd Papaioannou, quipped to Cloudera’s Awadallah that Yahoo will keep innovating on Hadoop and everyone could keep reselling it. Papaioannou declined to comment on the rumors of a Hadoop spinout, but did tell me via email, “I think Apache Hadoop will remain the go-to place to get access to new improvements and innovation in the core Hadoop platform. That’s exactly why we announced our ‘double down’ strategy and the work we are doing on the next generation of both Map Reduce and HDFS.”

Death by a Thousand Startups

It’s not only large vendors that Cloudera will have to fight off; its real threat is death by a thousand startups and ISVs. At Structure: Big Data, NoSQL startup DataStax announced its own open-source Hadoop distribution based on the NoSQL database Cassandra, which provides a replacement for the Hadoop Distributed File System (HDFS). DataStax says this gives users the ability to process data and feed it back to applications at extremely low latencies, which Cloudera can’t offer because Apache Hadoop environments currently reside on separate infrastructure from application servers and databases. Om wrote earlier about Mapr, a startup focused on improving the performance and reliability of the HDFS. Appistry is already addressing this with its own wholly-distributed HDFS alternative.

Launches weren’t over yet. Another database startup called Hadapt officially launched Wednesday with a product that melds the HDFS-based HBase database with traditional RDBMS capabilities. HBase is an Apache Hadoop subproject heavily used by Facebook, and included as part of Cloudera’s Hadoop distribution. And next Tuesday, high-performance computing pioneer Platform Computing — which has a presence in many large financial data centers and 10 of the top 20 Fortune companies — will be announcing an analytics offering that applies its current cluster- and grid-management capabilities to MapReduce workloads. As noted above, management tools are where Cloudera actually makes money selling software as opposed to services.

There are several commercial alternatives to Apache Hadoop MapReduce, as well. Pervasive Software’s DataRush software is designed for writing big data workflows and to take full advantage of multi-core processors. And Cascading, an open-source, data-processing API sits atop MapReduce. A startup called Concurrent offers commercial support and services for Cascading. Amazon Web Services offers a cloud-based Hadoop service called Elastic MapReduce, which spares users the cost of buying their own gear on which to run Hadoop workloads.

Confused? Here’s a round-up of currently available Hadoop distributions:

Full-on distributions

  • Apache Hadoop
  • Cloudera’s Distribution including Apache Hadoop (that’s the official name)
  • IBM Distribution of Apache Hadoop
  • DataStax Brisk
  • Amazon Elastic MapReduce

HDFS alternatives

  • Mapr
  • Appistry CloudIQ Storage Hadoop Edition
  • IBM Global Parallel File System (GPFS)
  • CloudStore

Hadoop MapReduce alternatives

  • Pervasive DataRush
  • Cascading
  • Hive (an Apache subproject, included in Cloudera’s distribution)
  • Pig (a Yahoo-developed language, included in Cloudera’s distribution)

Cloudera Isn’t Flinching — Yet

Even with all this competition, however, it’s unclear whether Cloudera actually feels its iron grip on the commercial Hadoop world slipping away. CEO Mike Olson thinks a rich ecosystem of Hadoop companies is necessary if it’s to grow into a multi-billion-dollar business like he thinks it can, but he sees most of that activity taking place up the stack from the foundational distribution layer where Cloudera operates. He said via email, “I believe there’s an enormous opportunity for smart companies, and even open-source projects, to build a new generation of data analysis tools on top of that platform.”

His colleague Awadallah was slightly less politic in his response when asked specifically about the DataStax distribution, stating in a video interview with my colleague Stacey Higginbotham Wednesday that he thinks DataStax’s distribution is a “big mistake,” and he doesn’t think the company can yet back up its claims of Hadoop support. He added that a better alternative to trying to reinvent the wheel in terms of Hadoop support and stability would have been for DataStax to keep its focus on Cassandra partner with Cloudera on the Hadoop integration.

Cloudera has plenty of reason to be confident, actually. Among its ranks are Hadoop creator Doug Cutting and former Yahoo colleague Awadallah, as well as Chief Scientist Jeff Hammerbacher – who previously led Facebook’s massive data efforts — and Vertica vetertan Omer Trajman. Olson himself is the former CEO of SleepyCat Software, which distributed the open-source Berkeley DB database before Oracle bought the company in 2006. Or, as Adwallah put it, “[W]e have the muscle to be able to back up our words with execution.” Further, as long as Facebook and Yahoo continue contributing their webscale-driven — and proven — enhancements back to Apache Hadoop, Cloudera has plenty of fuel to feed its evolution. Facebook, for example, is responsible for the popular Hive query language that gives Hadoop users a SQL-like experience many prefer to MapReduce, and, as noted above, Yahoo is currently pushing for a next-generation architecture that addresses some known performance bottlenecks with Apache Hadoop.

But the threat is real. Cloudera has partnerships with many analytics vendors, but none of the companies mentioned here are operating up the stack from Cloudera. They’re all addressing the foundational HDFS, Hadoop MapReduce and cluster-management areas where Cloudera presently does business (although IBM and EMC are operating up the stack with analytics software, too). With so many options available — and with Apache Hadoop code open to anyone who wants to use it — every vendor with aspirations of making big money in Hadoop is going to have to work extra hard to convince users they’re adding value worth paying for.

Image courtesy of Flickr user NileGuide.com.

  1. Great roundup Om. Cloudera has a strong lead on Hadoop. The big picture is that there is plenty of opportunity in terms of market share above the stack.. So don’t think a war is neccessary in this business there is tons of space to make money for all the companies that you mentioned.

    Share
    1. Thanks John. The piece is by Derrick Harris. I cant take the credit for the job well done.

      Share
  2. Your reference to IBM’s GPFS – might you mean “General”, rather than “Global”? I was unable to find any references to to “Global Parallel File System”, but did find many with “General…”.

    http://www.clintsherwood.com/GPFS.pdf

    Share
  3. Derrick really gets it. Reflected everything that I heard at Structure and then some. Great round up.

    Share
  4. Derrick, love the Flickr photo you picked :)

    Share
  5. -You can have other filesystems to HDFS, but how close do they come in $/TB for reliable storage. The strength of HDFS is not just its scale, but the cost of that scale: EMC will find that hard to compete against. Every terabyte stored in HDFS is a terabyte that EMC or IBM GPFS don’t get you paying for in hundreds of dollars.

    Share
  6. Good overview, however I strongly disagree with the term Hadoop War.

    - Hadoop is not a finished product, it never made to Release 1. It got basically stalled at 0.20.2 or 0.21.0, depending who you ask.
    The reason is, that Hadoop is an extremely complex product. There are so many open bugs, that even if they are solved, it would be a huge effort to test the whole system again and verify still everything works.
    No one will do this right now.
    - There is high demand for a solution like Hadoop, just because of the growth of the internet. And, Hadoop being a complex product, users need support and guidance to use it.
    It just a question of supply and demand that companies like Cloudera and DataStax are coming up. I would not say this a war. We have Red Hat Linux because there was a demand for a stable Linux with support. Nobody calls this a war …

    Share
    1. Derrick Harris Monday, March 28, 2011

      Joachim,

      I think you cut off the Red Hat analogy too soon. If Cloudera is Red Hat for Hadoop, one might call IBM the Novell of Hadoop, and DataStax the Ubuntu of Hadoop. Point being, demand is creating a number of suppliers all vying to be the OS, if you will, of Hadoop environments.

      Share
  7. Pig and Hive are both Apache projects. They (and Cascading) build on top of Hadoop MapReduce to provide higher level abstractions rather than being alternatives. 75% of MapReduce jobs at Yahoo are in fact Pig jobs and 90% of MapReduce jobs at Facebook are Hive jobs.

    Share
    1. Derrick Harris Monday, March 28, 2011

      Duly noted. I call them alternatives because they give users an alternative option to writing directly in MapReduce. To the degree they simplify the experience for certain users, they might minimize the importance of any given MapReduce implementation or proprietary query language.

      Share
  8. Derrick — great stuff on this article. It’s awesome to see Cloudera get this sort of attention. Big Data should be the focus of technology shops, vc’s, and entrepreneurs as we move into the future; it’s just a bigger problem than people recognize. Kudos to Amr & Jeff @ Cloudera for building a great team and having the vision to see how “big” Big Data will be in the years to come.

    Share
  9. Just to be clear, Yahoo! never opensourced Hadoop. Hadoop was started by Doug Cutting and Mike Cafarella as part of the Apache Nutch project (an Apache Lucene related project) to help solve scaling out large web-crawling jobs and it was done well before Doug ever joined Yahoo!

    Share
  10. Charles Papir Tuesday, March 29, 2011

    One of the must effective and longest longevity “Big Data” company I ever came accross has been 1010data LLC (1010data.com) and no one has ever mention them in any of their articles. The depth of experience and markets is impressive.

    Share

Comments have been disabled for this post