9 Comments

Summary:

Streaming media powerhouse Netflix says its experience with Amazon Web Services outages led to best practices and technology that can insulate Netflix — and potentially other companies — from the impact of weather-related and other events.

netflix-logo

As data centers struggle to fend off or repair the effects of superstorm Sandy, Netflix says lessons it learned from past Amazon Web Services outages helped it dodge a bullet last week when an availability zone in Amazon’s US East data center complex went down. Other companies that have been impacted by cloud outages might be able to apply these lessons as well.

Netflix first noted issues with AWS US East at 8:30 a.m. EDT last Monday, but the problem showed up as a network issue, not a problem with Amazon’s Elastic Block Store (EBS) service which caused initial confusion, according to the blog entitled “A Post-mortem of October 22, 2012 AWS degradation.” According to Netflix Reliability Architect Jeremy Edberg and Director of Cloud Solutions Ariel Tseitlin:

“When we were able to narrow down the network issue to a single zone, Amazon was also able to confirm that the degradation was limited to a single Availability Zone. Once we learned the impact was isolated to one AZ, we began evacuating the affected zone.”

Because Netflix had experience from other single-zone outages, it had run evacuation drills and was able to do this with minimal disruption.

Netflix’s Asgard technology helped in this effort. Asgard, which Netflix open sourced last summer, is a web interface (once known as Netflix Application Console) that engineers use to deploy code changes and manage resources on Amazon. According to Netflix, the technology lets engineers track multiple AWS components — Amazon Machine Images (AMIs), EC2 instances, etc. — used by their applications and manage them more efficiently than Amazon’s own console allows.

Here’s now Edberg and Tseitlin described the effectiveness of Asgard in their post:

“In the past we identified zone evacuations as a good way of solving problems isolated to a single zone and as such have made it easy in Asgard to do this with a few clicks per application. That preparation came in handy on Monday when we were able to evacuate the troubled zone in just 20 minutes and completely restore service to all customers.”

Also because of past experience, Netflix said it makes sure all of its software, including its Cassandra clusters, run in three Availability Zones.

So far, AWS US East has not exhibited problems associated with the massive superstorm. But given past issues — and the fact that other east coast data centers have crashed — companies that rely on Amazon’s cloud are (or should be) scrambling to find ways to mitigate outages and performance problems.

For example, Heroku, the Salesforce.com-owned Platform-as-a-Service provider that depends heavily on Amazon’s US East location, is pushing new features ahead of schedule to make it easier for applications to scale out as needed. Last week, Heroku released a “Follower” feature to its Heroku Postgres database service that replicates production databases. On Tuesday, the company noted on its status site that it’s prematurely exposing an alpha version of the feature targeting cross-region replication to Amazon’s US West region specifically to handle potential outages resulting from Sandy.

You’re subscribed! If you like, you can update your settings

  1. Cloud Training Tuesday, October 30, 2012

    What’s happening with Hurricane Sandy really makes you appreciate how vital having multiple availability zones is

  2. Here is a helpful public site for real time health and availability of Amazon Web Services in all regions http://www.systemswatch.com great to know if they are having issues minute to minute.

  3. Hi Barb, nice insight. At Ilesfay (cloud based replication startup) we’ve never gone down even though we’ve been using AWS (all regions) since 2009. Even companies much smaller than Netflix can avoid cloud risks by following some best practices. FYI: Here’s Ilesfay’s set of principles for building resilient cloud applications: http://www.ilesfay.com/cms/default.asp?iID=GFILMJ

  4. The simple answer is use multiple AZs and regions. However that requires not only more money on running duplicate instances but tools to manage instances and data across zones and regions as if they were different data centres. Netflix have Asgard to do this.

  5. Chuckle. Assguard.

    1. ah, i can’t believe i missed this Beavis & butthead implication of Asgard! thanks for reminding me!

  6. Hi guys, despite the fact that they “isolated” the issue and worked towards fixing and improving the experience was so bad that we have decided to cancel our accounts with Netflix.
    The user experience still needs to be improved.

  7. @luis. are you referring to problems during the various amazon glitches or in general?

  8. Antonio Menoyo Friday, November 2, 2012

    I love netflix it is cool and my 2 year old loves it even more.
    A very good afordable service and you can view it on TV on your Ipad iPhone great stuff !!!

Comments have been disabled for this post