Blog Post

What Amazon and Its Customers Must Learn From Last Week’s Outage

Stay on Top of Enterprise Technology Trends

Get updates impacting your industry from our GigaOm Research Community
Join the Community!

A lot has already been written, here on GigaOM and elsewhere, about the fault that knocked out part of Amazon‘s North Virginia data center last week. Opinion is divided: One side blames Amazon for its technical failings; the other holds the cloud giant’s customers responsible for their bad judgment. But the answer to where the primary responsibility lies is a little less black and white. Let’s examine each side.

Today, Amazon released the company’s initial assessment of what went wrong. An error in a change to the network configuration inside the North Virginia data center routed large volumes of primary network traffic onto the lower-capacity network reserved for Amazon’s Elastic Block Stores (EBS). Due to the unexpectedly high load, volumes within a single Availability Zone lost their connections to the network, to one another and, most importantly, to their redundant backup instances or replicas. When the initial network was restored, all of the affected volumes simultaneously attempted to use it to locate and resynchronize with their replicas. The network overload even affected EBS users in other Availability Zones, something that was not supposed to happen.

As Derrick Harris noted on Friday,

“If we think about the AWS network as a highway system, the effects of the outage were like those of a traffic accident. Not only did it result in a standstill on that road, but it also backed up traffic on the onramps and slowed down traffic on other roads as drivers looked for alternate routes. The accident is contained to one road, but the effects are felt on nearby and connected roads, too.”

In that respect, there doesn’t seem to be much room for doubt that Amazon is at least partly responsible, and that the failure should never have cascaded as far or as long as it did.

But we’re also seeing reports from Amazon customers who managed to operate relatively unscathed throughout the problem period. Customers such as Twilio, Bizo, Mashery and Engine Yard designed their systems to understand both the power and limitations of using a commodity service like Amazon’s. As InfoWorld columnist David Linthicum notes, ”You have to plan and create architectures that can work around the loss of major components to protect your own services.” Foursquare, Reddit, Quora and hundreds more suffered greatly because of failings in Amazon’s data center. Might their suffering have been lessened if they’d planned ahead in a similar way to Engine Yard or Twilio?

Still, whatever mistakes Amazon’s customers may have made, and however they pinched pennies to cut cloud services down to the cheapest, least fault-tolerant configuration they could get away with, the initial fault must lie with Amazon. Even poorly architected customer systems would not have been pushed to failure if Amazon’s underlying infrastructure had continued to perform as expected. Maybe some long-term good can come from short-term pain and embarrassment.

For more thoughts on Amazon’s recent outage, and the lessons that Amazon and its customers must learn, see my latest Weekly Update for GigaOM Pro (subscription required).

Image courtesy of flickr user hans.gerwitz

2 Responses to “What Amazon and Its Customers Must Learn From Last Week’s Outage”

  1. Vanderdecken

    What happened here is very similar to something that happened in the early days of email servers.

    The great vacation reply flood.

    Two emailers went on vacation about the same time, each setting an automatic vacation reply of “out of office”. As their final emails arrived at each others servers that server sent back a out of office reply which in turn generated an out of office reply to the original message and so on.

    My point here is that the rush to newer technologies and solutions often comes at the expense of experience. Every new kid programmer on the block believes they have a perfect solution to solve the problem. When the reality is often they do not know the history of the real world problems and tend to shuffle experienced programmers to the door.

    Amazon suffered from too little thinking and too much bravado in a “what could possibly go wrong” scenario.

    It will happen again unless Amazon rethinks their strategies to include the past.

    Next it will be Redmond’s Cloud crashing hard and I mean down for a month with data losses, from their reliance on “It can’t happen here” style of thinking.

    Bigger is not always better. Centralization of data is not always the best plan. Cheaper is not always faster, nor is Faster always cheaper.