4 Comments

Summary:

Once again Amazon has experienced significant problems at its big US East data center. The snafu has taken down Foursquare, Reddit, Heroku and other popular websites.

Amazon Web Services
photo: Flickr/Will Merydith

Updated: Here we go again. Problems with Amazon’s Elastic Block Storage (EBS) service have brought down Foursquare, Reddit, Heroku, and other popular websites. Once again, Amazon’s U.S. East data center in Virginia is ground zero for these issues, just as it was last June when there were two significant outages.  

Here’s the lowdown from the  AWS status page:

10:38 AM PDT We are currently investigating degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region.
11:11 AM PDT We can confirm degraded performance for a small number of EBS volumes in a single Availability Zone in the US-EAST-1 Region. Instances using affected EBS volumes will also experience degraded performance.
11:26 AM PDT We are currently experiencing degraded performance for EBS volumes in a single Availability Zone in the US-EAST-1 Region. New launches for EBS backed instances are failing and instances using affected EBS volumes will experience degraded performance.
Update from AWS status page.
12:32 PM PDT We are working on recovering the impacted EBS volumes in a single Availability Zone in the US-EAST-1 Region.
1:02 PM PDT We continue to work to resolve the issue affecting EBS volumes in a single availability zone in the US-EAST-1 region. The AWS Management Console for EC2 indicates which availability zone is impaired.EC2 instances and EBS volumes outside of this availability zone are operating normally. Customers can launch replacement instances in the unaffected availability zones but may experience elevated launch latencies or receive ResourceLimitExceeded errors on their API calls, which are being issued to manage load on the system during recovery. Customers receiving this error can retry failed requests.
2:20 PM PDT We’ve now restored performance for about half of the volumes that experienced issues. Instances that were attached to these recovered volumes are recovering. We’re continuing to work on restoring availability and performance for the volumes that are still degraded.

We also want to add some detail around what customers using ELB may have experienced. Customers with ELBs running in only the affected Availability Zone may be experiencing elevated error rates and customers may not be able to create new ELBs in the affected Availability Zone. For customers with multi-AZ ELBs, traffic was shifted away from the affected Availability Zone early in this event and they should not be seeing impact at this time.

Update: The company’s Relational Database Service is also balky. According to the status page:
11:03 AM PDT We are currently experiencing connectivity issues and degraded performance for a small number of RDS DB Instances in a single Availability Zone in the US-EAST-1 Region.
11:45 AM PDT A number of Amazon RDS DB Instances in a single Availability Zone in the US-EAST-1 Region are experiencing connectivity issues or degraded performance. New instance create requests in the affected Availability Zone are experiencing elevated latencies. We are investigating the root cause.
The US East data center is Amazon’s oldest and largest. Amazon, by virtue of its dominance as public cloud provider, gets a ton of press when it experiences outages. Last week at Structure: Europe, Amazon CTO Werner Vogels said the company sees no reticence among businesses — even risk-averse European businesses — to putting workloads on Amazon infrastructure. Still, outages like this are likely to increase skittishness among potential customers who have not yet deployed on public cloud infrastructure.
It also remains something of a mystery why so many tech-savvy companies (Hello– Heroku???) deploy so much of their work at that particular US East site when Amazon itself recommends deploying across geographies and availability zones.

Stay tuned for updates here.

  1. It’s interesting that the US East data centre seems to suffer most from outages. It reminds me of the seemingly constant problems Rackspace had in their older Dallas FW data centre before they completed major upgrade works.

    At least it encourages people to build redundancy across facilities and that “this cloud thing” isn’t the single answer to redundancy – you still have to architect your systems to handle failover and multiple levels!

    Share
  2. And it’s not just the US of A… I went to purchase a deal in panama, and even http://OfertaSimple.com was down too. Half the world is served through amazon.

    Share
  3. I am using Quora on daily basis almost and as part of my role as the chief evangelist in Newvem.com I follow AWS cloud very closely over the last 3 years at least.

    I had noticed one very interesting thing – in contrary to past AWS outages (I covered in my blog – http://www.iamondemand.com/post/5005375142/amazon-outage-is-it-a-story-of-a-conspiracy) >>>>

    Quora didn’t go down in this one ..seems that someone there learned a lesson and changed the DR architecture to fit the principal of shared responsibility and the concept that the cloud will always fail .. >> you must be prepare for that – Learn more how – http://gigaom.com/cloud/amazon-outages-lessons-learned/

    Share
  4. robertovalerio Monday, October 22, 2012

    @geckoboard is down as well. It does not affect our business but I am missing my management dashboards…!

    Share

Comments have been disabled for this post