12 Comments

Summary:

Updated @ 1.14  p.m. on June 10, 2009: Looks like a small portion of Amazon Web Services’ Elastic Compute Cloud service went down for an extended period this evening, though the extent of the problem was not clear. Amazon on its AWS status web site explained […]

Updated @ 1.14  p.m. on June 10, 2009: Looks like a small portion of Amazon Web Services’ Elastic Compute Cloud service went down for an extended period this evening, though the extent of the problem was not clear. Amazon on its AWS status web site explained that it had lost connectivity in a part of its network because of a “localized power issue.” Apparently the cause of the problem was a lightning strike. The only zone that was most impacted was in the United States. The problem started at around 6:40 pm PST and was fixed roughly three hours later.

This is not the first time Amazon’s web services have taken a nose-dive. Last year in July, the S3 service went offline, causing widespread problems. It also suffered an outage in February 2008. But by and large, Amazon has had a good record with its services as far as uptime is concerned.

That said, one can’t underscore that as more and more companies start to depend on Amazon’s web services, the company has very little room for error. Today’s incident also shows the fragility of the “cloud,” in general. Outside forces such as as it can be knocked out by a single lightning strike can cause problems. These issues have been a common source of aggravation for hosting providers in the past as well.

Update: The company made further clarifications and said that it wasn’t as much an outage as a localized problem. A spokesperson after talking to me sent the following emailed statement that outlined the minor nature of the problem. I have updated the post to include Amazon’s statement and latest information.

This type of incident is exactly why EC2 offers Multiple Availability Zones to our developers. When a developer chooses to run an application in Multiple Availability Zones, the application will be fault tolerant against exactly this type of event and will keep performing. Additionally, during the issue last night, any user running in just one zone could also chose to re-launch their degraded instances.

This particular issue last night was not a generalized service issue and was limited to a small percentage of compute instances in one EC2 Availability Zone within the US Region, due to a lightning storm.  There was no impact to other AWS services.

We provide several different mechanisms that enable customers to build fault-tolerant applications that continue to run smoothly in exactly this type of situation. First, they can (and many did last night) immediately launch new instances in the same Availability Zone.

Second, this type of incident is exactly why EC2 offers Availability Zones to our customers.  When a developer chooses to run an application in multiple Availability Zones, the application should keep performing smoothly as the instances in the second zone are unaffected by events occurring in the first.

  1. I think it’s irresponsible and alarmist to claim that EC2 “went down.” It was a “set of racks” in a “single availability zone.” The purpose of exposing the ‘availability zone’ concept to developers is to allow them to ensure their own availability even during events such as this.

    The S3 outage broke all use of S3; this was a connectivity loss to a fraction of EC2. The two cannot be compared. Everyone was perfectly able to launch new instances to replace the out-of-commission ones.

    Is an “EC2 is down” GigaOM post to be expected for every AWS status dashboard update? To suggest that EC2 as-a-whole was ‘knocked out’ or in a ‘nose-dive’ is really quite inaccurate.

    Share
  2. The trick is to be prepared for an availability zone outage. If your architecture is fault tolerant across availability zones, then you can survive a single lightning strike.

    The problem is what happens if several lightning strikes hit all availability zones. But in this case, they’ve probably deserved it…

    Share
  3. Just goes to show you, just because it’s “cloud” doesn’t mean you can ignore the tried and true. Gotta build in redundancy to your systems and not rely on a single provider for everything. Costs more but if you need the uptime, it’s the price you pay. I have a bad feeling a lot of people are going to put a lot of trust into the cloud thinking it’ll be bulletproof which is obviously not the case today.

    Share
  4. lighting striking ??? ………….havent they installed some thing called lightning coductutor ….which is being used from 1800

    Share
  5. Outages happen, it’s just a part of IT. As Oren and Ken both said it’s important to build redundancy into your system architecture, and the more critical your systems are the more you need to worry about it. I can say from personal experience that the EC2 (and other AWS) systems have been far more reliable than any other hosted systems that I’ve used, including internally hosted environments, and are much easier to grow.

    Share
  6. Well it seems S3 is not as fool proof as we all thought! Don’t they have mirror sites at different geographical areas?

    Share
  7. Even the might shall Cloudfail.

    Share
  8. Cloud computing is such a bad departure from one of the underlying concepts of the web that makes it so important and valuable. Yes, global interconnectivity, but always with equal localization of data. I don’t believe that redundancy is necessary byproduct of the design when efficient indexing and compression of localized networked files is utilized. The more we centralize and rely on one company’s servers, the less independent and secure our information becomes, regardless of how this simplifies things. Even the name is bad- they ought to call it SNOWBALLING.

    Share
  9. Ah yes, if it bleeds, it leads.

    Don’t worry, this will be forgotten in a month when something else goes bump in the night…

    Share
  10. The “Cloud” isn’t magic, people. You still have to architect a fault-tolerant system based on the tools that Amazon gives you. Anybody who just fires up an EC2 instance and thinks that their work is done is sadly mistaken.

    Share

Comments have been disabled for this post