10 Comments

Summary:

Updated: Today, Amazon’s Web Services have hit some bumps in the road, taking down a variety of popular sites such as Foursquare, Quora and Paper.li. Since clouds do fail perhaps the best thing to do is provide information and maybe a dollop of humor.

iStock_000011114669XSmall

Updated: Today, Amazon’s Web Services, including its Elastic Compute Cloud, relational database and its Platform-as-a-Service Elastic Beanstalk offerings have all hit some bumps in the road, taking down a variety of popular sites such as Foursquare, Quora and Paper.li. Amazon’s status board notes that some problems have been partially resolved in its data center serving the Eastern half of the United States as of 6:30 a.m. PDT, but the services are still struggling.

We’ve discussed at length the difficulty and expense of providing dial-tone-like reliability in the cloud (the so-called five nines) as well as the fact that as more popular web services rely on cloud platforms and people rely on those services, it’s only natural that failures in the data centers belonging to Google, Amazon, Rackspace, and others will result in problems for sites and a lot of people noticing those problems. The prevalence of instant communications, such as Twitter, only amplifies this.

However, as my colleague Derrick pointed out recently when Reddit had another issue with Amazon Web Services, a cloud-monitoring and benchmarking organization called CloudHarmony found  the Amazon EC2 data center down today had actually experienced zero down time throughout the duration of a recent one-year study.

So if we accept that clouds go down and people will be inconvenienced, then really the only thing to do is communicate well and perhaps add a sense of humor to your message–provided the loss of your service isn’t going to reduce your audience to cursing. Gmail, I’m looking at you! Quora, for instance, provided a cute YouTube video and the following error message, “We’d point fingers, but we wouldn’t be where we are today without EC2.”

Foursquare provided a running display of information on what was happening, Much like the trend in fun 404 pages, the habit of making light of a failure can endear your service to readers. And providing quality information that indicates you know what’s wrong and are working on it, is always a plus. However, what is refreshing about this latest outage is that so far the attitude is more resigned to cloud services going down for a bit as opposed to an all out condemnation of cloud services as unreliable. Is that progress?

Update: Amazon Web Services has issued an explanation of the events that caused the outage to the EC2 and Elastic Block Storage services:

8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.

  1. “a cloud-monitoring and benchmarking organization called CloudHarmony found that the Amazon EC2 data center that is down today has actually experienced zero down time throughout the duration of a recent one-year study.”

    Problem here is more than just the failure, it is the duration and the extent of such a failure.

    They should be ready to, say, within a couple hours re-distribute the lot while they fix it, as in having redundancy, an “easily” deployable way out of the blackout.

    “so far the attitude is more resigned to cloud services going down for a bit as opposed to an all out condemnation of cloud services as unreliable.”

    Excuse me, but half a day later and it is still not sorted at all, it can happen at any time and you will get on your toes, as when such a thing happen, you will be hit hard – one down, in this case, means everybody down – how can this be any reliable ?

    Share
  2. AWS after all is a “cloud” and they have published best practices in the past about DR considerations such as backup servers in a separate EC2 “Region” so DNS can keep your apps/data available.

    Those same considerations you’d have to make if you were running your own DCs but many companies can’t afford building separate geographically dispersed DCs on their own.

    Its one of the benefits of using a public cloud that not enough people design for.

    Share
  3. Even school sites can have proplems. You can never use the net for anything critical. Usually school will give an extension but if you rely on lecture videos for study then u are out of luck.
    I think that to accept cloud. We have to accept “my dog ate my homework” excuses.

    Share
  4. Add on to that gigaom stopped delivering rss for over a week. I have to manually find the link to the mobile site.

    Share
  5. Reminds me of Nassim Taleb. Five 9s are just that, five 9s. There is always a low probability high impact event waiting to happen.

    Share
  6. …and Cloud ain’t even 5 9′s.

    Share
  7. For all customers affected by EC2 downtime, I would like to recommend ElasticHosts as an alternative cloud service (www.elastichosts.com) – we offer a 5 day free trial for our cloud servers in US or UK, which is likely enough at least to bridge the gap.

    Share
    1. Wow. Thats low. I am sure that a legit company don’t spam websites. Shameful.

      Share
      1. Its a free country and sharks are everywhere.

        Customer’s of AWS will decide if they are happy or not with performance/price.

        Again, my view is that serious cloud architecture in AWS listens to AWS’s advice.

        AWS has different zones for a reason.
        They have different Regions for a reason.

        Its a ‘cloud’… people need to learn to take advantage of it.

        If AWS N. VA is failing but there had been forethought to a DR in AWS N. California. Recovery and HA can be maximized.

        DR instances may not even have to be in a Running state but in a Stopped state and triggered active when needed and DNS changes are made.

        Share
  8. cloud rely on internet connection which quality varies on different country. That’s why it was not popular at present, since it was very inconvenient if line breakdown. Personally, i wasn’t much rely on cloud storage/platform, until there is breakthrough of internet connection in my country.

    Share

Comments have been disabled for this post