27 Comments

Summary:

Massive thunderstorms notwithstanding, the fact that Amazon’s U.S. East data center went down again Friday night while other cloud services hosted in the same area kept running raises anew questions about whether Amazon is suffering architectural glitches that go beyond acts of God.

2840585154_232b19bfbd_z

Massive thunderstorms notwithstanding, the fact that Amazon’s U.S. East data center went down again Friday night while other cloud services hosted in the same area kept running raises anew questions about whether Amazon is suffering architectural glitches that go beyond acts of God. While most Amazon services were back up Saturday morning, the company was still working on provisioning the backlog for its ELB load balancers as of 5:31 p.m. eastern time, according to the AWS dashboard.

This outage — the second this month — took down Netflix , Instagram, Pinterest, and Heroku, as Om previously reported. The storm was undoubtedly huge, leaving 1.3 million in the Washington D.C. area without power as of Saturday afternoon, but Joyent, an Amazon rival, also hosts cloud services from an Ashburn, Virg. data center and experienced no outage, something its marketing people were quick to point out.

The implication is that Amazon, with all its talk about redundancy and availability, shouldn’t be having these issues if others are not.

Steve Zivanic, VP of marketing for Nirvanix, another rival cloud provider, said customers should simply stop defaulting to Amazon’s cloud.  “It’s becoming rather clear that the answer for [Amazon's] customers is not to try to master the AWS cloud and learn how to leverage multiple availability zones in an attempt to avoid the next outage but rather to look into a multi-vendor cloud strategy to ensure continuous business operations,” Zivanic said via email. “You can spend days, months and years trying to master AWS or you could simply do what large-scale IT organizations have been doing for decades — rely on more than one vendor.”

The fact that Amazon, like any other data center-dependent business is not bulletproof also raises questions about why its customers don’t pursue a multi-cloud strategy or, if they’re going to rely solely on Amazon, why they put so much of their workload in one geography —  a practice Amazon itself advises against. Of course, it isn’t good practice for any vendor to blame snafus on its customers.

Presumably the tech folks at Instagram, Heroku, et al. know better. Earlier this month,  I asked Byron Sebastian, the Salesforce.com VP of platforms who runs the Heroku business, if Heroku was actively seeking other cloud platform partners. He said the company is always evaluating its options.

Twitter was awash in comments. Many wondered why Amazon’s data center did not cut over to generator power while others, like Gartner analyst Lydia Leong preferred to wait to see what part Amazon’s data center operator played in this mess.

Reached for comment Saturday afternoon, an Amazon spokeswoman reiterated that the storm caused Amazon to lose primary and backup generator power to an availability zone in its east region overnight and that service had been mostly restored. She said the company would share more details in the coming days.

Photo courtesy of Flickr user Bruce Guenter

  1. Redundancy is key. How about a cloud load balancer to seamlessly failover to another datacenter, even another AWS availability zone? http://totaluptimetech.com/solutions/cloud-load-balancing/

    Share
    1. Redundancy is very key, without which any business is impacted big time. So the business should ensure the HA solution is in place and have their data residing in 2 Availablity zones spread across the geo.

      Share
  2. Abhijeet Kumar Saturday, June 30, 2012

    I am in no position to explain why there was a power issue that affected only one availability zone in Amazon’s Virginia location, that is something I guess we will hear more about soon from Amazon. Having said that, AWS is available in other regions which were still working fine. A lot of these posts by people, who do not know the facts (i.e. they don’t work for Amazon, Netflix, Pinterest, Heroku, Instagram etc), seems to be self-serving, malicious rants without any concrete technical base. When you talk about outages, Google had an outage in April, where several Gmail accounts were affected, Apple earlier this week had an outage that affected their iCloud, Twitter had an outage too earlier this week. Why do people go so religiously batshit when they hear AWS having an outage in one availability zone in one region?

    Share
    1. Amazon gets picked on because their relevant. From an enterprise IaaS perspective who really cares about any of the other providers you mentioned? AWS is the premier provider for IaaS. They have mind share not just in tech companies like Netflix, Pintrest and Instagram but in the non-tech enterprise as well.

      Try going into a company as a consultant and suggest cloud vendors outside of AWS and chances are they will ask you about AWS. The con for Amazon is that if events like this continue the happen potential enterprise customers will get spooked off. Right or wrong AWS is developing a perception for not being reliable. If Netflix can’t design a redundant solution on AWS how will a non tech company?

      Share
      1. Abhijeet Kumar Sunday, July 1, 2012

        I agree, as an IaaS cloud provider, especially given the fact that many smaller enterprises are only beginning to move to cloud based hosting (most tech start-ups use cloud based hosting like AWS), any AWS outage will be interesting. However, Google, Twitter etc are supposed to be pretty highly regarded when it comes to available services offered on the web. I was reading that Netflix, one the biggies on AWS, usually didn’t get affected during most previous AWS outages. I also read they got affected only for an hour or so for this outage, which is still something. But even Google running in their own data centers aren’t doing way better, are they?

        Share
      2. Abhijeet Kumar Sunday, July 1, 2012

        I was reading how Twilio one of the start-ups using AWS avoided getting affected by previous AWS outages (which affect a single Availability Zone in a single zone).

        http://www.twilio.com/engineering/2011/04/22/why-twilio-wasnt-affected-by-todays-aws-issues/

        I have read similar posts from Netflix. The fact that AWS was available in other Availability Zones in the same region and all other regions, is something that should be taken into account before going batshit on Amazon for an outage like this.

        Share
  3. Abhijeet Kumar Saturday, June 30, 2012

    Correction: It is available in all regions now at this time. What I really meant was during the issue, AWS was having an outage in one availability zone in one region, while all other regions were perfectly fine.

    Share
  4. Abhijeet Kumar Saturday, June 30, 2012

    Another correction, Google had a minor outage in June not in April, so pretty recent.

    http://www.pcworld.com/article/257177/gmail_outage_likely_hit_several_million_on_thursday.html

    Share
    1. Jack Murgia Monday, July 2, 2012

      Abhijeet Kumar- this particular outage affected the elastic load balancer service- our company was prevented from removing instances in the affected availability zone from the load balancer, sending users to downed servers. Architecting for an AZ failure did not save our site. This is probably what happend to others.

      Also, this came on the heals of events Friday AM and just a week or so ago with EBS. It has been avery bad month for Amazon. I can’t speak for all of Amazon’s customers, but it has certainly has me exploring a multi-vendor or non-aws strategy.

      Share
  5. They really hit the nail on the head here! Multi-vendor solutions provide the best redundancy. I like http://www.vmstormvps.com for my cloud hosting.

    Share
  6. welcome to the post-PC world

    Share
  7. Barb, you refer to Virginia as VA or Virginia, not Virg.

    Share
  8. I Am OnDemand Sunday, July 1, 2012

    Great amount of High availability architectures and best practices – check at http://www.newvem.com/topic/resources

    Share
  9. The reality is this – in order to operate in the cloud, you need to be able to understand that failure is not only an option – it’s a promise. This is the reason why Microsoft chose ClearDB to run MySQL on Windows Azure. It’s also the reason why Heroku and AppFog customers who use ClearDB didn’t notice any database level disruption in the last two outages. Check out our site for more info: http://bit.ly/Lc6XKp

    Disclaimer: I work for ClearDB

    Share
  10. Harish Ganesan Sunday, July 1, 2012

    Overcoming Outages in AWS : High Availability Architecture Patterns
    http://harish11g.blogspot.in/2012/06/aws-high-availability-outage.html

    Share

Comments have been disabled for this post