6 Comments

Summary:

A big AWS outage late Thursday, which took down some sites for hours, shows the risk of putting too many loads in one data center — and that is not a cloud specific problem. Experts advise better planning of workloads to run across Amazon regions.

6091370824_f55d937089_z (1)

A significant Amazon Web Services outage, which took down popular sites including Heroku for hours late Thursday, shows the risk of putting too many loads in one data center. While this outage occurred in Amazon’s cloud, it wasn’t just a cloud-specific problem. It shows that building in redundancy is critical — whether your app runs in your own data center or in someone else’s cloud.

In short, AWS users should make sure their workloads run across AWS regions to prevent future snafus.

As Om reported earlier, Amazon attributed the failure to a power outage affecting its U.S. East data center in Virginia. That makes sense, US East is Amazon’s oldest and biggest data center. It suffered another major outage in April 2011 and was also beset by performance problems as Amazon rebooted thousands of EC2 instances in December.

Amazon is notoriously tight-lipped about its data centers, but in March Accenture analyst Huan Liu used his own techniques to come up with pretty impressive stats about the inner workings of AWS including the relative size of its data centers.  U.S. East is the largest by far (see Huan Liu’s chart below).

Outages like this, which first showed up on the Amazon Web service dashboard at 8:50 p.m. Pacific time and was declared resolved at 3:26 a.m. Pacific time, draw lots of headlines and posturing — cloud competitors leapt onto Twitter to state that their sites were up and running — but cloud experts warned against over reaction. (AWS VP and CTO Werner Vogels will speak at next week’s  GigaOM Structure conference.)

This is a tempest in a teapot, said Carl Brooks, analyst at Tier1 Research. “AWS outages are still magnified out of proportion to their severity. It doesn’t help their credibility with the paleoconservative enterprise paranoid who will use this as an excuse to buy more absurdly overpriced IT from the usual suspects.”

In other words, take a deep breath. And make sure you design your AWS workloads to run across geographies.


		
  1. First, if you design to Amazon’s best practices you would not have been impacted from any of their service interruptions. Putting all of your services in a single availability zone let alone in a single region is something they highly recommend against. Second, corporate datacenter operators can start to criticize when they provide anything close to the level of scalability and reliability at the price point AWS offers.

    Share
  2. I remember supporting offices that could have their inhouse business critical DB down for days on end “because shush, the experts are working” but they’d crawl up my culo if hosted mail went away for 2 hours on a Sunday. This isn’t much different. Manage expectations, people.

    Share
  3. Reblogged this on vijaya prasad and commented:
    Amazon outage: wrinkles in early cloud technology or something that happens to even the best services?

    Share
  4. Public cloud outages remind us of the limitations of public cloud. Public clouds are not always secure, elastic or cheap over the long term, and these very public ‘glitches’ underscore the fact that private cloud is best – in terms of cost, security, scalability and innovation – every time. We just wrote a blog about this too if anyone is interested you can check it out at http://www.pistoncloud.com/2012/06/aws-outage-public-vs-private-cloud/

    Share
  5. We at cloudHQ have motto: don’t put all your eggs in our basket.

    The problem is not cloud, the problem is because people relay on single provider for their mission critical data.

    So that is reason we build cloudHQ.net: this is basically synchronization and replications service: we will replicate data between different cloud services.
    I.e., if you use Evernote, it is prudent to have backup of all notes also in your Drobpox.
    Or if you use Basecamp, why not to have all your Basecamp projects also in Dropbox or maybe Google Drive?

    Anyway please check out http://cloudHQ.net/dropbox

    Share
  6. I see there are a lot of AWS defenders in the comments. While Amazon does offer failover capabilities, it is not automatic. The fact that you have to think about it at all is sort of weird considering that the one of the sales points of the cloud is that you shouldn’t have to worry about the actual infrastructure — that should be Amazon’s job to worry for you.

    One thing that I haven’t seen discussed: how did a power outage bring down their datacenter at all? Where were the UPSes? Where were the generators? Where where the multiple redundant circuits? This brings up a big question of what Amazon’s datacenter actually looks like, and from a seven hour outage due to power loss, it isn’t looking good! I wonder how Azure datacenters compare?

    Right now we’re using a private cloud infrastructure, but we want to invest in a public cloud for burstability, but I wouldn’t want to spend money on a cloud datacenter that doesn’t have backup power. lol.

    Share

Comments have been disabled for this post