If Amazon’s Cloud Fails, Just Keep Smiling

Updated: Today, Amazon’s Web Services, including its Elastic Compute Cloud, relational database and its Platform-as-a-Service Elastic Beanstalk offerings have all hit some bumps in the road, taking down a variety of popular sites such as Foursquare, Quora and Paper.li. Amazon’s status board notes that some problems have been partially resolved in its data center serving the Eastern half of the United States as of 6:30 a.m. PDT, but the services are still struggling.

We’ve discussed at length the difficulty and expense of providing dial-tone-like reliability in the cloud (the so-called five nines) as well as the fact that as more popular web services rely on cloud platforms and people rely on those services, it’s only natural that failures in the data centers belonging to Google, (s goog) Amazon, (s amzn) Rackspace, (s rax) and others will result in problems for sites and a lot of people noticing those problems. The prevalence of instant communications, such as Twitter, only amplifies this.

However, as my colleague Derrick pointed out recently when Reddit had another issue with Amazon Web Services, a cloud-monitoring and benchmarking organization called CloudHarmony found  the Amazon EC2 data center down today had actually experienced zero down time throughout the duration of a recent one-year study.

So if we accept that clouds go down and people will be inconvenienced, then really the only thing to do is communicate well and perhaps add a sense of humor to your message–provided the loss of your service isn’t going to reduce your audience to cursing. Gmail, I’m looking at you! Quora, for instance, provided a cute YouTube video and the following error message, “We’d point fingers, but we wouldn’t be where we are today without EC2.”

Foursquare provided a running display of information on what was happening, Much like the trend in fun 404 pages, the habit of making light of a failure can endear your service to readers. And providing quality information that indicates you know what’s wrong and are working on it, is always a plus. However, what is refreshing about this latest outage is that so far the attitude is more resigned to cloud services going down for a bit as opposed to an all out condemnation of cloud services as unreliable. Is that progress?

Update: Amazon Web Services has issued an explanation of the events that caused the outage to the EC2 and Elastic Block Storage services:

8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.