How to deal with cloud failure: Live, learn, fix, repeat


Like it or not, sweeping software bugs are just part and parcel with operating the largest computing systems the world has ever seen. On Monday night, Amazon Web Services (s amzn) published a detailed post-mortem of its latest cloud outage, which struck on Friday night as massive thunderstorms knocked out power to one of the company’s east coast data centers. However, issues with the data center’s backup generator were just a catalyst — it was a handful of latent software bugs that manifested themselves as the system attempted to restore itself that did the real damage.

Although AWS is already working on fixes at all levels, this won’t be the last cloud computing outage we see, either from AWS or its competitors in the cloud provider space.

On Monday, I spoke with Geoff Arnold, an industry consultant and entrepreneur-in-residence at U.S. Venture Partners whose past includes a tenure as Distinguished Engineer at Sun Microsystems and building and managing distributed systems for Amazon, Huawei and, most recently, Yahoo (s yhoo). We spoke before AWS released the details of what caused the outage, but he suspected the real issue was more about bugs and less about a power failure. His take: “As we gain more experience [building globally distributed systems], we encounter failure domains that we haven’t hit before.”

By and large, that’s just the price of doing business in the cloud. It’s a constant cycle of living and learning from your mistakes.

The reality, he said, is that engineers know the various components of their systems are going to fail and they design around the known fallibilities. But when you’re building some of the largest computing systems ever assembled, you’re bound to run into problems for which you haven’t planned or didn’t even know existed. Bug testing against every possible problem across hundreds of thousands of servers and multiple data centers just is neither easy nor, really, feasible.

Even Google (s goog), whose state-of-the-art web infrastructure lets it advertise no planned downtime for its cloud services and claim more than 99.9 percent uptime for Gmail, can’t test its massive systems to the nth degree. As Google Research Director Peter Norvig explained during a keynote in August 2011, there’s a line where it just becomes too cost-prohibitive to keep testing processes, although that varies by company. In April, Gmail suffered an outage that left up to 35 million users temporarily without access to their email.

Amazon’s first-mover disadvantage

As for AWS specifically, Arnold thinks it’s still among the most-resilient cloud platforms around. To some degree, though, it’s paying the price for being the biggest and most-advanced cloud available, and for having a stable of high-profile customers that make news when they go down. “[I]t’s certainly the case that Amazon does have some first-adopter disadvantages here,” Arnold said.

Geoff Arnold

For example, he explained, while AWS has had to re-architect various pieces of its platform at relatively high levels of effort and expense, its smaller, often times newer, competitors are learning from its mistakes without having to live them firsthand. Additionally, he said, AWS is somewhat limited in its options for high availability because it tries to keep its prices low for basic services such as computing and storage. “You can throw dollars at the problem” and engineer around faults in more-expensive ways, he said, but those costs will likely get passed on to customers.

And while the in vogue thing for competitors to do when AWS crashes is to pile on with potshots, Arnold isn’t convinced they — especially the open source alternatives — would fare any better operating at Amazon’s scale. “Frankly, none of them seems to be as robust, mature and well thought through as Amazon’s [cloud], he said. “I think OpenStack is going to be much less stable than anything Amazon has produced.”

A big reason for this is the open source development model that accepts contributions from large ecosystems of developers and that must account for the needs of all the big names attaching their strategies to a given project. “In some sense,” Arnold said, “I think Amazon has an advantage over the open source alternatives because [it] only [has] to answer to one boss — and that’s Jeff Bezos.”

No end in sight

Given the relative youth of cloud-scale systems such as Amazon’s, the logical question is whether they’ll ever evolve to a stage where availability isn’t an issue. Arnold says the answer is “not likely,” but there is some help on the way thanks to software-defined networks.

One of the major problems with cloud platforms is that “they have large numbers of components interacting in ways that were not necessarily designed together,” he explained, making it very difficult to predict what will go wrong. For example, virtual servers are designed in isolation from storage systems which are designed in isolation from load balancers, but they’re all sewn together and running over the same network in the end. Arnold thinks an SDN-style top-down network management approach, where everything is provisioned holistically, could help resolve this particular issue, but that’s still probably five years away.

When I asked whether there will be a time when we’ve figured out how to handle distributed systems of a particular size, Arnold replied, “I used to think so, but I’m getting more cynical now.”

One reason is that software engineers are always going to devise new ingenious ways to integrate systems together, causing entirely new problems to arise. Another is that the increasingly distributed nature of systems-engineering teams — especially those developing open source code — makes it easier to add bugs into systems and harder to catch them (see, e.g., the leap second bug that made its way back into the Linux kernel earlier this year). “We’ll continue to surprise ourselves by introducing failures,” he said.

Feature image courtesy of Shutterstock user asharkyu.



By pure logic, any physical failure handling subsystem can itself fail. Therefore, there’s we can never say we have a fully failure-proof system.


Mr. Arnold’s comments about OpenStack clearly reveals his lack of understanding of the dynamics. He shouldn’t be talking about topics he has no clue on. It doesn’t matter if his resume includes colossal failures like Sun Microsystems and Yahoo, he has absolutely no clue on how open source projects work. Period.

Ravi Thakur

AWS – which doesn’t claim to be a Tier 4 data center, nor has it proven itself to be 100 percent failure proof – has always advocated that customers design for failure, so it’s important to take that advice to heart. To be prepared for a future Amazon outage, it’s critical to design your platform to have a backup option, so customers’ businesses aren’t affected. We were actually one of the first enterprise companies on private beta for Amazon’s EC2 cloud, and having followed Amazon’s advice, we were prepared and our downtime was just minutes, while everyone else was down for hours.
– Ravi Thakur, VP of Services & Support, Coupa (

Elaad Teuerstein

Here’s Nati Shalom’s take on that Amazon outage:
He uses the blow that Heroku took to make a point about surviving failures.

From the blog: “Failures are inevitable, and often happen when and where we least expect them to. Instead of trying to prevent failure from happening we should design our systems to cope with failure. …
…The question that comes out of this experience IMO is not necessarily how to deal with failures (those lessons are as old as the mainframe or even older), but rather — why are we failing to implement the lessons?”


“As we gain more experience [building globally distributed systems], we encounter failure domains that we haven’t hit before.” ——- Really? No one ever had a Generator or Telephone line or Circuit or equipment FAILURE !? REALLY?!?!? —— These People are AMATEURS…. Redundency, Redundency, Redundency and if then you think its good enough … DO IT AGAIN !! — Its Just going to happen again…

Charles T. Betz

This is one of the most intelligent articles I’ve read on this topic. The cycle of failure is inevitable as we build, stabilize, commoditize, and then conceive of some ambitious next step we hope will give us competitive advantage.

Shaleen Shah

I have to agree with your insights here – and doing business on the cloud won’t even happen without electricity.. makes you appreciate those who are working hard to provide us with power, 24×7 a year. I wonder if someone can come up with a sustainable solution to manage this risk. Anyway, have a Happy 4th and thanks for sharing this post.

Comments are closed.