Once again, Netflix shows how to avoid a cloud meltdown

netflix-logo

As data centers struggle to fend off or repair the effects of superstorm Sandy, Netflix says lessons it learned from past Amazon Web Services outages helped it dodge a bullet last week when an availability zone in Amazon’s US East data center complex went down. Other companies that have been impacted by cloud outages might be able to apply these lessons as well.

Netflix first noted issues with AWS US East at 8:30 a.m. EDT last Monday, but the problem showed up as a network issue, not a problem with Amazon’s Elastic Block Store (EBS) service which caused initial confusion, according to the blog entitled “A Post-mortem of October 22, 2012 AWS degradation.” According to Netflix Reliability Architect Jeremy Edberg and Director of Cloud Solutions Ariel Tseitlin:

“When we were able to narrow down the network issue to a single zone, Amazon was also able to confirm that the degradation was limited to a single Availability Zone. Once we learned the impact was isolated to one AZ, we began evacuating the affected zone.”

Because Netflix had experience from other single-zone outages, it had run evacuation drills and was able to do this with minimal disruption.

Netflix’s Asgard technology helped in this effort. Asgard, which Netflix open sourced last summer, is a web interface (once known as Netflix Application Console) that engineers use to deploy code changes and manage resources on Amazon. According to Netflix, the technology lets engineers track multiple AWS components — Amazon Machine Images (AMIs), EC2 instances, etc. — used by their applications and manage them more efficiently than Amazon’s own console allows.

Here’s now Edberg and Tseitlin described the effectiveness of Asgard in their post:

“In the past we identified zone evacuations as a good way of solving problems isolated to a single zone and as such have made it easy in Asgard to do this with a few clicks per application. That preparation came in handy on Monday when we were able to evacuate the troubled zone in just 20 minutes and completely restore service to all customers.”

Also because of past experience, Netflix said it makes sure all of its software, including its Cassandra clusters, run in three Availability Zones.

So far, AWS US East has not exhibited problems associated with the massive superstorm. But given past issues — and the fact that other east coast data centers have crashed — companies that rely on Amazon’s cloud are (or should be) scrambling to find ways to mitigate outages and performance problems.

For example, Heroku, the Salesforce.com-owned Platform-as-a-Service provider that depends heavily on Amazon’s US East location, is pushing new features ahead of schedule to make it easier for applications to scale out as needed. Last week, Heroku released a “Follower” feature to its Heroku Postgres database service that replicates production databases. On Tuesday, the company noted on its status site that it’s prematurely exposing an alpha version of the feature targeting cross-region replication to Amazon’s US West region specifically to handle potential outages resulting from Sandy.

loading

Comments have been disabled for this post