1 Comment

Summary:

Hadoop is a very valuable tool, but it’s far from perfect. While Apache, Cloudera, EMC, MapR and Yahoo focus on core architectural issues, there is a group of vendors trying to make Hadoop a more-fulfilling experience by focusing on business-level concerns such as applications and utilization.

server farm

Hadoop is a very valuable tool, but it’s far from perfect. One potential concern for businesses whose primary products don’t end in .com is that it was built with the web in mind. That means it was designed for massively scaled architectures and to handle petabytes of data, and that implementing and managing it can require a team of Ph.D.-level engineers.

For companies that don’t operate at webscale certain aspects of Hadoop look like overkill, and getting it to address their specific needs look like Herculean chores. For these companies, Hadoop just has to be useful.

While vendors including Cloudera, EMC, MapR and Yahoo, as well as the Apache Software Foundation, are working (some might say fighting) to shape distribution-level Hadoop concerns such as the core MapReduce and file-system architectures, vendors up the stack are focusing on business-level concerns, often without regard for what’s running underneath. Two of them — Platform Computing and Pervasive Software — released new products this week.

Platform MapReduce product aims to brings that company’s expertise in distributed-systems management to Hadoop clusters. I detailed Platform’s MapReduce vision in March, and the new product is a first step toward delivering on that vision. Platform supports numerous MapReduce and file system distributions, but its application-level management features make it shine.

Probably chief among them is the ability to maximize utilization of Hadoop hardware by running multiple applications on the same cluster. With advanced scheduling features and, the company claims, “10,000 priority levels,” numerous applications can run simultaneously while Platform’s software takes care of determining which one is running where and which one can access the shared data at any given time.

Pervasive, a data-integration specialist that has turned its attention to Hadoop, takes a different approach to catering to enterprise customers. For a while now, it has been selling a product called DataRush, which lets users write MapReduce workflows optimized for multicore processors, resulting in faster performance on fewer nodes. Yesterday, it announced early access for a product called TurboRush, which brings those same capabilities to Apache Hive, the Facebook-created tool for bringing traditional SQL database and data warehouse features to Hadoop clusters.

The premise behind DataRush and TurboRush is compelling because although it’s fascinating to talk about Hadoop clusters at Yahoo and Facebook that span tens of thousands of clusters and store petabytes of data, most businesses don’t have those webscale needs. As I noted in a recent GigaOM Pro report on Hadoop (subscription required), the average cluster size among Hadoop users appears to be less than 50 nodes and nowhere near petabytes of data. There’s no use buying, managing or paying the power bill for more servers than necessary if a relatively small group of multicore processors could do the trick.

With Yahoo’s Hadoop Summit taking place tomorrow, we’re destined to see lots of news about new distributions, tools and services. And although distributions and architectural innovations rightfully steal much of the spotlight, it’s all important. Neither Platform, Pervasive nor any other vendor has all the answers when it comes to Hadoop, but the good news is that whatever your specific concern, there’s likely someone either selling or working up a solution to it.

Image courtesy of Flickr user MrFarber.

You’re subscribed! If you like, you can update your settings

  1. Vladimir Rodionov Thursday, June 30, 2011

    Hadoop is definitely overkill for something less than 50-100 nodes (for 95% of a potential customers). Who needs fault tolerant job scheduler or additional data integrity support (CRC on all File I/O operations) if you have 10-15 node cluster?

Comments have been disabled for this post