Amazon's EC2 Service Suffered Some Hiccups

12 Comments

Updated @ 1.14  p.m. on June 10, 2009: Looks like a small portion of Amazon (s amzn) Web Services’ Elastic Compute Cloud service went down for an extended period this evening, though the extent of the problem was not clear. Amazon on its AWS status web site explained that it had lost connectivity in a part of its network because of a “localized power issue.” Apparently the cause of the problem was a lightning strike. The only zone that was most impacted was in the United States. The problem started at around 6:40 pm PST and was fixed roughly three hours later.

This is not the first time Amazon’s web services have taken a nose-dive. Last year in July, the S3 service went offline, causing widespread problems. It also suffered an outage in February 2008. But by and large, Amazon has had a good record with its services as far as uptime is concerned.

That said, one can’t underscore that as more and more companies start to depend on Amazon’s web services, the company has very little room for error. Today’s incident also shows the fragility of the “cloud,” in general. Outside forces such as as it can be knocked out by a single lightning strike can cause problems. These issues have been a common source of aggravation for hosting providers in the past as well.

Update: The company made further clarifications and said that it wasn’t as much an outage as a localized problem. A spokesperson after talking to me sent the following emailed statement that outlined the minor nature of the problem. I have updated the post to include Amazon’s statement and latest information.

This type of incident is exactly why EC2 offers Multiple Availability Zones to our developers. When a developer chooses to run an application in Multiple Availability Zones, the application will be fault tolerant against exactly this type of event and will keep performing. Additionally, during the issue last night, any user running in just one zone could also chose to re-launch their degraded instances.

This particular issue last night was not a generalized service issue and was limited to a small percentage of compute instances in one EC2 Availability Zone within the US Region, due to a lightning storm.  There was no impact to other AWS services.

We provide several different mechanisms that enable customers to build fault-tolerant applications that continue to run smoothly in exactly this type of situation. First, they can (and many did last night) immediately launch new instances in the same Availability Zone.

Second, this type of incident is exactly why EC2 offers Availability Zones to our customers.  When a developer chooses to run an application in multiple Availability Zones, the application should keep performing smoothly as the instances in the second zone are unaffected by events occurring in the first.

12 Comments

Om Malik

Guys

We have updated the post to reflect what happened. Clearly it happened late and couldn’t get more information first hand. Anyway AWS outlined how small the problem was. I have made the changes to reflect that.

David,

Last time the S3 outage was seen as something marginal but became a bigger issue. So I was going with the early reports as per tips I was receiving from Amazon customers. Just to be clear, being alarmist is not my nature nor how I write my posts.

Kord Campbell

I manage about 100 servers on Amazon EC2, and lost one in this latest outage. We failed pretty quickly to a backup server. We run several such backups, and will even fail out of the cloud to our managed POPs if it requires it.

As long as the instances you run are tied to a single point of failure – the bare metal – stuff like this is bound to occur. I didn’t have much better luck running my own boxes in the past, but the difference is now I can spin up a new instance to take its place in a matter of minutes, not hour or days. This is exactly what I did yesterday.

@dirky and other naysayers – clearly you haven’t managed a significant amount of metal in the past, and dealt with the nightmare lurking within that responsibility. You’re making a blanket accusation based on ignorance and unfamiliarity with the underlying technologies. Wielded correctly, it’s a cost saver AND a lifesaver.

I’m a 100% satisfied with Amazon’s service and performance, and it’s getting better every day.

Scott Trotter

The “Cloud” isn’t magic, people. You still have to architect a fault-tolerant system based on the tools that Amazon gives you. Anybody who just fires up an EC2 instance and thinks that their work is done is sadly mistaken.

Doug Mohney

Ah yes, if it bleeds, it leads.

Don’t worry, this will be forgotten in a month when something else goes bump in the night…

dirky

Cloud computing is such a bad departure from one of the underlying concepts of the web that makes it so important and valuable. Yes, global interconnectivity, but always with equal localization of data. I don’t believe that redundancy is necessary byproduct of the design when efficient indexing and compression of localized networked files is utilized. The more we centralize and rely on one company’s servers, the less independent and secure our information becomes, regardless of how this simplifies things. Even the name is bad- they ought to call it SNOWBALLING.

David Robins

Well it seems S3 is not as fool proof as we all thought! Don’t they have mirror sites at different geographical areas?

Anthony Eden

Outages happen, it’s just a part of IT. As Oren and Ken both said it’s important to build redundancy into your system architecture, and the more critical your systems are the more you need to worry about it. I can say from personal experience that the EC2 (and other AWS) systems have been far more reliable than any other hosted systems that I’ve used, including internally hosted environments, and are much easier to grow.

gp

lighting striking ??? ………….havent they installed some thing called lightning coductutor ….which is being used from 1800

Ken

Just goes to show you, just because it’s “cloud” doesn’t mean you can ignore the tried and true. Gotta build in redundancy to your systems and not rely on a single provider for everything. Costs more but if you need the uptime, it’s the price you pay. I have a bad feeling a lot of people are going to put a lot of trust into the cloud thinking it’ll be bulletproof which is obviously not the case today.

Oren

The trick is to be prepared for an availability zone outage. If your architecture is fault tolerant across availability zones, then you can survive a single lightning strike.

The problem is what happens if several lightning strikes hit all availability zones. But in this case, they’ve probably deserved it…

David

I think it’s irresponsible and alarmist to claim that EC2 “went down.” It was a “set of racks” in a “single availability zone.” The purpose of exposing the ‘availability zone’ concept to developers is to allow them to ensure their own availability even during events such as this.

The S3 outage broke all use of S3; this was a connectivity loss to a fraction of EC2. The two cannot be compared. Everyone was perfectly able to launch new instances to replace the out-of-commission ones.

Is an “EC2 is down” GigaOM post to be expected for every AWS status dashboard update? To suggest that EC2 as-a-whole was ‘knocked out’ or in a ‘nose-dive’ is really quite inaccurate.

Comments are closed.