Every Hadoop implementation encounters the occasional crisis, including moments when the folks running Hadoop feel like their hair is on fire. Sometimes it happens before you get to production, which can cause organizations to throw the Hadoop baby out with the bathwater. Often, these moments occur after the first production launch, which means you have a “success disaster” on your hands (although it will probably feel more like disaster than success).
Implementing and scaling Hadoop is enormously complicated. However, if you learn to recognize problems early, you can prevent your hair (and your Hadoop implementation) from igniting. Here are some signs of danger, along with lessons we’ve learned for heading them off.
Danger sign 1: You never get to production
Moving from proof of concept (POC) to production is a significant step for big data workloads. Scaling Hadoop jobs is fraught with challenges. Sometimes large jobs just won’t finish. A job that ran in testing won’t run at production scale. Data can also be an issue: the POC often uses unrealistically small or uniform datasets.
Before you go into production, perform realistic scale and stress testing. Such testing will exercise the scalability and fault-tolerance of your applications. It will also help you develop a model for capacity planning so you can stay ahead of the curve.
Danger sign 2: You start missing deadlines
Your first application made it into production. Congratulations! Initially, you easily hit your SLAs, but as use of the Hadoop cluster grows, the run times become unpredictable. At first deadlines are missed sporadically, so the problem is ignored. Over time it gets worse, until a crisis emerges.
Don’t wait for a crisis to take action. As comfortable margins start to erode, add capacity or optimize your applications to keep pace. Adjust your capacity-planning model, with particular attention on worst-case performance, so that it matches what you’re seeing in reality.
Danger sign 3: You start telling people they can’t keep all that data
Another symptom of impending crisis is shrinking data retention windows. Initially, you hoped to keep 13 months of data for year-over-year analysis. Because of space constraints, you find yourself cutting that number. At some point, you lose the ability to do the type of big data analysis that justified your Hadoop investment to begin with.
Shrinking retention windows are the storage equivalent of missed deadlines. The dynamic is also the same: a margin that initially seems comfortable becomes “just enough” and then “not enough.” Act early. As margins erode, revisit your capacity models to see why your predictions didn’t hold, and adjust to better track what’s happening.
Danger sign 4: Your data scientists are starved
An over-utilized Hadoop cluster can stifle innovation. There’s not enough compute capacity for data scientists to launch large jobs. There’s insufficient space for them to store large, intermediate results.
Capacity planning routinely omits or underestimates the needs of data scientists. That omission, compounded with inadequate planning for production work, means the needs of data scientists often become marginalized. Be sure your planning includes data scientists’ requirements, and act early when you see signs that capacity is falling short.
Danger sign 5: Your data scientists are reading Stack Overflow
In the early days of your Hadoop implementation, your ops team and data scientists worked hand in hand. If the data scientists ran into problems, the ops team would jump in to help. But as your Hadoop implementation became successful, the stresses of maintaining and growing it consumed your operations team. Your data scientists now troubleshoot Hadoop themselves, often by trawling through questions posted to [company]Stack Overflow[/company].
As Hadoop expands and becomes more mission critical, the effort to maintain it increases. If you want to keep your data scientists focused on data science (and off of Stack Overflow), you may have to revisit the size of your Hadoop operations team.
Danger sign 6: It starts getting really, really hot
Your hair might not be on fire, but your data center could be! When servers are provisioned for power, there’s often an assumption they won’t run at capacity. But a large Hadoop job can red-line racks of machines for hours, blowing under-provisioned circuits. (Similar problems arise on the cooling side.) Make sure your Hadoop cluster can run at full power for extended periods of time.
Danger sign 7: You get sticker shock
The number one “success disaster” with Infrastructure-as-a-Service based deployments of Hadoop (such as AWS) is out-of-control spending. You suddenly find yourself with a bill that is three times last month’s, blowing your budget.
Capacity planning is as important for IaaS-based Hadoop implementations as it is for on-premise ones—not for managing capacity, but for managing costs. But good capacity planning is just the start. If you plan on growing an IaaS-based Hadoop implementation, expect to invest heavily in systems to track and optimize costs, as Netflix has done.
Smooth(er) Hadoop scaling
Hadoop plans typically underestimate the effort required to keep a Hadoop cluster running smoothly. It’s an understandable miscalculation. With classic enterprise applications, the initial implementation effort is orders of magnitude larger than ongoing maintenance and support. People assume Hadoop follows a similar pattern, but it doesn’t. Hadoop gets harder to maintain as it scales, and it requires a lot of work from your ops team.
Good capacity planning is essential to promote sanity. That means not only having a good capacity model, but updating it as it starts to diverge from reality (and it will). Don’t support innovation as an afterthought: provide data scientists with a guaranteed level of support. Adding capacity is not always the answer: managing usage is equally important. Get your users (and the business owners driving them) to plan time to optimize their jobs between bursts of new development. Just a bit of optimization can significantly reduce your ongoing costs.