9 Comments

Summary:

It’s neither easy nor glamorous — data scientists get all the love — but making sure your Hadoop cluster is properly configured and applications are running optimally is necessary, especially as applications move into production. Here are five tools to help you do it.

shutterstock_69852472

Before you can get into the fun part of actually processing and analyzing big data with Hadoop, you have to configure, deploy and manage your cluster. It’s neither easy nor glamorous — data scientists get all the love — but it is necessary. Here are five tools (not from commercial distribution providers such as Cloudera or MapR) to help you do it.

Apache Ambari

Apache Ambari is an open source project for monitoring, administration and lifecycle management for Hadoop. It’s also the project that Hortonworks has chosen as the management component for the Hortonworks Data Platform. Ambari works with Hadoop MapReduce, HDFS, HBase, Pig, Hive, HCatalog and Zookeeper.

Apache Mesos

Apache Mesos is a cluster manager that lets users run multiple Hadoop jobs, or other high-performance applications, on the same cluster at the same time. According to Twitter Open Source Manager Chris Aniszczyk, Mesos “runs on hundreds of production machines and makes it easier to execute jobs that do everything from running services to handling our analytics workload.”

Platform MapReduce

Platform MapReduce is high-performance computing expert Platform Computing’s entre into the big data space. It’s a runtime environment that supports a variety of MapReduce applications and file systems, not just those directly associated with Hadoop, and is tuned for enterprise-class performance and reliability. Platform, now part of IBM, built a respectable business managing clusters for large financial services institutions.

StackIQ Rocks+ Big Data

StackIQ Rock+ Big Data is a commercial distribution of the Rocks cluster management software that the company has beefed up to also support Apache Hadoop. Rocks+ supports the Apache, Cloudera, Hortonworks and MapR distributions, and handles the entire process from configuring bare metal servers to managing an operational Hadoop cluster.

Zettaset Orchestrator

Zettaset Orchestrator is an end-to-end Hadoop management product that supports multiple Hadoop distributions. Zettaset touts Orchestrator’s UI-based experience and its ability to handle what the company calls MAAPS — management, availability, automation, provisioning and security. At least one large company, Zions Bancorporation, is a Zettaset customer.

If there are more Hadoop management tools floating around, please let me know in the comments.

Feature image courtesy of Shutterstock user .shock.

  1. Apache Ambari isn’t even ready for consumption, and you didn’t even bother checking out Cloudera Manager, which is the most widely used one out there.

    Share
    1. Derrick Harris Friday, May 18, 2012

      Yes, but, as I noted, I didn’t include distro-specific tools. If you’re using CDH, I assume you’ve looked at Cloudera Manager, too. Same for MapR.

      Share
    2. Not that your a bitter Cloudera employee or anything… right?

      Share
  2. Prasun Sinha Friday, May 18, 2012

    Ankush, from Impetus Technologies, is another Hadoop Cluster Management tool.

    Share
  3. Michael Shaler Saturday, May 19, 2012

    DataStax Enterprise eliminates Hadoop name-node as single point of failure.

    Share
  4. I recommend checking out OceanSync (http://www.oceansync.com) for Hadoop and Data management.

    Share
  5. Robert J. Berger Monday, May 21, 2012

    Check out Ironfan an open source project from Infochimps. Its a layer on top of Opscode Chef for Orchestrating deployments and lifecycle of Clusters. https://github.com/infochimps-labs/ironfan/wiki I’ll be giving a lightning talk on Ironfan at HBaseCon2012 tomorrow (5/22/2012) http://www.hbasecon.com/sessions/lightning-talk-orchestrating-clusters-with-ironfan-and-chef/

    Share
  6. Jo Maitland Monday, May 21, 2012

    The name node issue no longer exists if you are running Apache
    Hadoop 0.23.2 or higher. Fixing the name node single point of failure is no longer a differentiator when you can get it in the open source version.

    Share
  7. Hadoop is new and raw. It needs many pieces to make it a complete solution. The skills required to bring all these technologies together is mind boggling and creates a huge burden on IT. There are very few solutions out there that fit the bill. One promising technology (as an alternative to Hadoop) is the HPCC Systems platform, a completely integrated solution – ETL + Data Mining + Data Delivery. The people at LexisNexis have been using this platform for more than 10 years and have built a very successful business around it. So it seems to be battle tested and enterprise ready. For more visit hpccsystems.com

    Share

Comments have been disabled for this post