Analyst Report: Scaling Hadoop clusters: the role of cluster management


From Facebook to Johns Hopkins University and from Alcatel-Lucent to Procter & Gamble, organizations are coping with the challenge of processing unprecedented volumes of data.

Open-source tools such as Hadoop have emerged to address this requirement for data analysis, typically scaling in parallel across hundreds or thousands of computers grouped together into clusters.

For the data scientists working with the data, these powerful clusters provide the capabilities needed to manipulate data and extract insight. But for the system administrators laboring away behind the scenes, creating, managing and maintaining such large groups of computers pose serious challenges. Operating systems must be installed, patched and monitored across the entire cluster. Applications software must be installed and dependencies mitigated. Data must flow, and the whole edifice must present a single — and simple — face to end users with no need or wish to comprehend the complexity beneath the surface.

It’s possible to manage these clusters manually, but the process is a labor-intensive one that’s typically prone to error. So IT managers (and companies like IBM and Dell, which sell the hardware they need) are increasingly turning to cluster-management solutions capable of automating a wide range of tasks associated with cluster creation, management and maintenance.

Note: The views and opinions expressed herein are solely those of Paul Miller and GigaOM Pro. StackIQ underwrote this report, but the final determination of content remained at all times with Paul Miller and GigaOM Pro.

Table of Contents

  1. Summary
  2. The challenge of scale
  3. An introduction to Hadoop
  4. Ensuring efficient infrastructure
    1. Scale up or scale out?
    2. Making clusters fit for purpose
  5. The role of cluster management
    1. Deployment
    2. Operation
    3. Maintenance
    4. Rocks
  6. Alternatives to Rocks
    1. Chef and Puppet
    2. Platform
    3. Dell Crowbar
    4. Apache Ambari Ambari is an incubator project
  7. Integrating cluster management with Hadoop
  8. Key takeaways
    1. Business considerations
  9. About Paul Miller
  10. About GigaOM Pro

Join Gigaom Research! Become a subscriber and get reports like these, plus our collection of over 1,700 reports from world-class analysts for just $995 a year.