Amazon's EC2 Service Suffered Some Hiccups

Updated @ 1.14  p.m. on June 10, 2009: Looks like a small portion of Amazon (s amzn) Web Services’ Elastic Compute Cloud service went down for an extended period this evening, though the extent of the problem was not clear. Amazon on its AWS status web site explained that it had lost connectivity in a part of its network because of a “localized power issue.” Apparently the cause of the problem was a lightning strike. The only zone that was most impacted was in the United States. The problem started at around 6:40 pm PST and was fixed roughly three hours later.

This is not the first time Amazon’s web services have taken a nose-dive. Last year in July, the S3 service went offline, causing widespread problems. It also suffered an outage in February 2008. But by and large, Amazon has had a good record with its services as far as uptime is concerned.

That said, one can’t underscore that as more and more companies start to depend on Amazon’s web services, the company has very little room for error. Today’s incident also shows the fragility of the “cloud,” in general. Outside forces such as as it can be knocked out by a single lightning strike can cause problems. These issues have been a common source of aggravation for hosting providers in the past as well.

Update: The company made further clarifications and said that it wasn’t as much an outage as a localized problem. A spokesperson after talking to me sent the following emailed statement that outlined the minor nature of the problem. I have updated the post to include Amazon’s statement and latest information.

This type of incident is exactly why EC2 offers Multiple Availability Zones to our developers. When a developer chooses to run an application in Multiple Availability Zones, the application will be fault tolerant against exactly this type of event and will keep performing. Additionally, during the issue last night, any user running in just one zone could also chose to re-launch their degraded instances.

This particular issue last night was not a generalized service issue and was limited to a small percentage of compute instances in one EC2 Availability Zone within the US Region, due to a lightning storm.  There was no impact to other AWS services.

We provide several different mechanisms that enable customers to build fault-tolerant applications that continue to run smoothly in exactly this type of situation. First, they can (and many did last night) immediately launch new instances in the same Availability Zone.

Second, this type of incident is exactly why EC2 offers Availability Zones to our customers.  When a developer chooses to run an application in multiple Availability Zones, the application should keep performing smoothly as the instances in the second zone are unaffected by events occurring in the first.