Seven signs your hair is on fire: The challenges of scaling Hadoop


Every Hadoop implementation encounters the occasional crisis, including moments when the folks running Hadoop feel like their hair is on fire. Sometimes it happens before you get to production, which can cause organizations to throw the Hadoop baby out with the bathwater. Often, these moments occur after the first production launch, which means you have a “success disaster” on your hands (although it will probably feel more like disaster than success).

Implementing and scaling Hadoop is enormously complicated. However, if you learn to recognize problems early, you can prevent your hair (and your Hadoop implementation) from igniting. Here are some signs of danger, along with lessons we’ve learned for heading them off.

Danger sign 1: You never get to production

Moving from proof of concept (POC) to production is a significant step for big data workloads. Scaling Hadoop jobs is fraught with challenges. Sometimes large jobs just won’t finish. A job that ran in testing won’t run at production scale. Data can also be an issue: the POC often uses unrealistically small or uniform datasets.

Before you go into production, perform realistic scale and stress testing. Such testing will exercise the scalability and fault-tolerance of your applications. It will also help you develop a model for capacity planning so you can stay ahead of the curve.

Danger sign 2: You start missing deadlines

Your first application made it into production. Congratulations! Initially, you easily hit your SLAs, but as use of the Hadoop cluster grows, the run times become unpredictable. At first deadlines are missed sporadically, so the problem is ignored. Over time it gets worse, until a crisis emerges.

Don’t wait for a crisis to take action. As comfortable margins start to erode, add capacity or optimize your applications to keep pace. Adjust your capacity-planning model, with particular attention on worst-case performance, so that it matches what you’re seeing in reality.

Danger sign 3: You start telling people they can’t keep all that data

Another symptom of impending crisis is shrinking data retention windows. Initially, you hoped to keep 13 months of data for year-over-year analysis. Because of space constraints, you find yourself cutting that number. At some point, you lose the ability to do the type of big data analysis that justified your Hadoop investment to begin with.

Shrinking retention windows are the storage equivalent of missed deadlines. The dynamic is also the same: a margin that initially seems comfortable becomes “just enough” and then “not enough.” Act early. As margins erode, revisit your capacity models to see why your predictions didn’t hold, and adjust to better track what’s happening.


Hadoop’s mascot. Source:

Danger sign 4: Your data scientists are starved

An over-utilized Hadoop cluster can stifle innovation. There’s not enough compute capacity for data scientists to launch large jobs. There’s insufficient space for them to store large, intermediate results.

Capacity planning routinely omits or underestimates the needs of data scientists. That omission, compounded with inadequate planning for production work, means the needs of data scientists often become marginalized. Be sure your planning includes data scientists’ requirements, and act early when you see signs that capacity is falling short.

Danger sign 5: Your data scientists are reading Stack Overflow

In the early days of your Hadoop implementation, your ops team and data scientists worked hand in hand. If the data scientists ran into problems, the ops team would jump in to help. But as your Hadoop implementation became successful, the stresses of maintaining and growing it consumed your operations team. Your data scientists now troubleshoot Hadoop themselves, often by trawling through questions posted to Stack Overflow.

As Hadoop expands and becomes more mission critical, the effort to maintain it increases. If you want to keep your data scientists focused on data science (and off of Stack Overflow), you may have to revisit the size of your Hadoop operations team.

Danger sign 6: It starts getting really, really hot

Your hair might not be on fire, but your data center could be! When servers are provisioned for power, there’s often an assumption they won’t run at capacity. But a large Hadoop job can red-line racks of machines for hours, blowing under-provisioned circuits. (Similar problems arise on the cooling side.) Make sure your Hadoop cluster can run at full power for extended periods of time.

Danger sign 7: You get sticker shock

The number one “success disaster” with Infrastructure-as-a-Service based deployments of Hadoop (such as AWS) is out-of-control spending. You suddenly find yourself with a bill that is three times last month’s, blowing your budget.

Capacity planning is as important for IaaS-based Hadoop implementations as it is for on-premise ones—not for managing capacity, but for managing costs. But good capacity planning is just the start. If you plan on growing an IaaS-based Hadoop implementation, expect to invest heavily in systems to track and optimize costs, as Netflix has done.

Smooth(er) Hadoop scaling

Hadoop plans typically underestimate the effort required to keep a Hadoop cluster running smoothly. It’s an understandable miscalculation. With classic enterprise applications, the initial implementation effort is orders of magnitude larger than ongoing maintenance and support. People assume Hadoop follows a similar pattern, but it doesn’t. Hadoop gets harder to maintain as it scales, and it requires a lot of work from your ops team.

Good capacity planning is essential to promote sanity. That means not only having a good capacity model, but updating it as it starts to diverge from reality (and it will). Don’t support innovation as an afterthought: provide data scientists with a guaranteed level of support. Adding capacity is not always the answer: managing usage is equally important. Get your users (and the business owners driving them) to plan time to optimize their jobs between bursts of new development. Just a bit of optimization can significantly reduce your ongoing costs.

Raymie Stata is the founder and CEO of Altiscale, a Hadoop-as-a-Service firm. He was previously chief technical officer at Yahoo, where he helped set the company’s open source strategy and initiated its participation in the Apache Hadoop project. Follow him on Twitter @rstata or @altiscale.


Peter Fretty

Unfortunately this applies to far too many organizations today. As a result, a lot of businesses struggle to achieve the benefits anticipated. At least at the level they hoped. One overarching reason — lack of solid strategy — something that was echoed in a recent SAS survey. Although organizations have the goals in place, they do not spend enough time building the foundation first.

Peter Fretty


Our VP of engineering just recently had an article published on Wired Innovation Insights that takes a closer look at why Hadoop clusters at scale behave much differently than general data center servers. I’d highly recommend the article to anyone interested in this subject.

Avoiding the Pain Threshold: Cluster Complexity Demands Automation:

Janos Matyas

At SequenceIQ we had the same problems – and decided to create and open source an SLA policy based autoscaling project for Hadoop YARN clusters: .

The project supports autoscaling for cloud deployments (AWS, Azure, GCC – using our open source, Docker based cloud agnostic Hadoop as a Service API, Cloudbreak – , but once our contributions for Hadoop, YARN and Ambari will be released (2.6 an 1.7) it’s going to support ‘static’ Hadoop clusters as well.

Comments are closed.