Scaling Hadoop clusters: the role of cluster management

1 Summary

From Facebook to Johns Hopkins University and from Alcatel-Lucent to Procter & Gamble, organizations are coping with the challenge of processing unprecedented volumes of data.

Open-source tools such as Hadoop have emerged to address this requirement for data analysis, typically scaling in parallel across hundreds or thousands of computers grouped together into clusters.

For the data scientists working with the data, these powerful clusters provide the capabilities needed to manipulate data and extract insight. But for the system administrators laboring away behind the scenes, creating, managing and maintaining such large groups of computers pose serious challenges. Operating systems must be installed, patched and monitored across the entire cluster. Applications software must be installed and dependencies mitigated. Data must flow, and the whole edifice must present a single — and simple — face to end users with no need or wish to comprehend the complexity beneath the surface.

It’s possible to manage these clusters manually, but the process is a labor-intensive one that’s typically prone to error. So IT managers (and companies like IBM and Dell, which sell the hardware they need) are increasingly turning to cluster-management solutions capable of automating a wide range of tasks associated with cluster creation, management and maintenance.

Note: The views and opinions expressed herein are solely those of Paul Miller and GigaOM Pro. StackIQ underwrote this report, but the final determination of content remained at all times with Paul Miller and GigaOM Pro.