19 Comments

Summary:

For second time in less than a month, Amazon’s Northern Virginia data center has suffered an outage and is impacting many popular services such as Instagram, Pinterest & Netflix. Amazon previously suffered an outage in its Northern Virginia facilities on June 14, 2012.

For the second time in less than a month, Amazon’s Northern Virginia data center has suffered an outage and is impacting many popular services. Amazon’s status dashboard shows that the Elastic Compute, Elastic Cache, Elastic MapReduce and Relational Database Services have been out for over an hour. Amazon is blaming the outage on what it describes as “a power event.”

Dominion Virginia Power, which is an electricity provider to many data centers in the Virginia region says that severe storms in the region have disrupted power supplies.

A line of severe storms packing winds of up to 80 mph has caused extensive damage and power outages in Virginia.  Dominion Virginia Power crews are assessing damages and will be restoring power where safe to do so.  We appreciate your patience during this restoration process.  Additional details will be provided as they become available.

The outage is impacting sites such as Instagram, Pinterest & Netflix. Heroku, a platform provider to many startups and mobile apps has been impacted as well. Amazon previously suffered an outage in its Northern Virginia facilities on June 14, 2012.

Here are the latest status updates from Amazon’s dashboard:

8:21 PM PDT We are investigating connectivity issues for a number of instances in the US-EAST-1 Region.
8:31 PM PDT We are investigating elevated errors rates for APIs in the US-EAST-1 (Northern Virginia) region, as well as connectivity issues to instances in a single availability zone.
8:40 PM PDT We can confirm that a large number of instances in a single Availability Zone have lost power due to electrical storms in the area. We are actively working to restore power.
8:49 PM PDT Power has been restored to the impacted Availability Zone and we are working to bring impacted instances and volumes back online.
9:20 PM PDT We are continuing to work to bring the instances and volumes back online. In addition, EC2 and EBS APIs are currently experiencing elevated error rates.
9:24 PM PDT We can confirm that a large number of RDS instances are impaired. We are actively working on recovering them.
9:54 PM PDT EC2 and EBS APIs are once again operating normally. We are continuing to recover impacted instances and volumes.

Update: The Washington Post reported 1.5 million customers are without power in the DC area Saturday morning, with nearly 500,000 affected in Northern Virginia. (Comparing the storm to a hurricane in terms of damage, Dominion VA Power tweeted Saturday morning that it’s restored about 90,000 but it will be a multi-day effort.) Some of the AWS ripple effects from Friday night’s massive storm have been resolved — Netflix and Pinterest are back — but some remain down, including Instagram. Heroku continues to report problems.

At 5 a.m. PDT, AWS was still reporting power-related issues with Elastic Compute, Elastic Beanstalk and Relational Database Service.

You’re subscribed! If you like, you can update your settings

  1. The mocking tone is not really appropriate in this instance. The DC area suffered a massive storm with 2M without power over five states. 60-80mph winds. Trees down everywhere. I think this is bigger than you imply.

    1. Stating facts Geoff, not mocking. Sorry if it read that way.

  2. Ah. I see you revised the story and toned it down.

    1. Sorry, typing on an iPad – sometimes can be hard ;-)

  3. Starting today, STOP typing on ipad…

    Actually we are boycotting APPLE…

  4. The CTO’s and Sysadmins for all the companies should be fired Monday morning… Lets not blame the weather but instead blame the fact that these companies allowed a “failure point” in one datacenter affect their entire chain of service instead of using “high availability” best practices.

    1. I don’t think there is a company on the planet who has invested more in scalability and reliability in the cloud than NetFlix. If any of those guys were fired I’d hire them right away. It is easy to sit on a couch and throw stones. Netflix, Instagram, Pinterest have very complex distributed architectures and none of us should be judging them. They have a accomplished amazing things with small staffs. My company has survived every AWS outage since we deployed on the cloud in 2009. I think our choice to stay away from RDS is one of the main reasons why. As much as we like the automation of RDS, it seems to go down with every outage. When the database is down, life is not good. We have had various servers go down but our database never has. That is the main reason why we have never missed a transaction. We are not any smarter than the Netflix guys. We have less complexity (due to much less traffic) and we chose to manage the database ourselves because the risks outweighed the automation.

    2. In many cases it may be calculated risk, costs of getting more nines in uptime are easily compared to lost profits, and all these instagrams are nothing we can not live without.

  5. Rod Boothby Friday, June 29, 2012

    Aren’t high quality data centers supposed to have back-up power generators? Massive storms aren’t actually a fair excuse. For example, the guys at the Switch Supernap have more power generation capacity than they have standard power available. http://www.switchnap.com/pages/all-things-switch/the-supernaps.php

    1. They do have backup power. The entire datacenter did not lose power. Some instances lost power which to me means some of the servers got knocked off line. That is much different than a complete datacenter losing power. Only 2 of our 50+ servers were impacted and we never missed a beat.

      1. Thanks Mike for sharing. It is valuable piece of information and hopefully Amazon will outline the extent of damage soon.

  6. Now it’s up. I can access Pinterest now. Location, Philippines. :)

  7. Now more then ever! – Newvem – KnowYourCloud! #AWS outage http://goo.gl/XFqmz.

  8. Today, Newvem will publish usage tips and recommendations on how to prevent and protect from AWS outages. Go here to read more – http://goo.gl/XFqmz

  9. Reblogged this on Billy Moses.

Comments have been disabled for this post