5 Comments

Summary:

Last week’s AWS outage outage might have outsmarted Netflix’s Chaos Monkeys, but the content-distribution giant isn’t about to turn its back on cloud computing. It was a relatively small blip in what has been better availability since the company made the move entirely to the cloud.

Last week’s Amazon Web Services outage might have outsmarted Netflix’s Chaos Monkeys, but the content-distribution giant isn’t about to turn its back on cloud computing. According to a Friday blog post from the Netflix cloud team, the outage (which started with a generator failure and resulted in a cascading bug that took down AWS’s Elastic Load Balancer feature) exposed some flaws in Netflix’s operations both within and beyond its control, but it was a relatively small blip in what has been better overall availability since the company made the move entirely to the cloud.

That the AWS outage resulted in a control plane backlog that prohibited customers from failing over into Availability Zones not affected by the generator failure was Amazon’s fault. However, Netflix’s Greg Orzell and Ariel Tseitlin write, the outage also highlighted some problems with its own load-balancing architecture that ended up compounding the problem by “essentially caus[ing] gridlock inside most of our services as they tried to traverse our middle-tier.” Netflix is working to fix this problem.

Still, they note, Netflix has had better overall uptime since moving to the cloud and is “still bullish on the cloud.” In part, that’s because Netflix has been able to architect its cloud-based services to be resilient even when AWS fails. Some of those decisions proved wise last week:

  • Regional isolation contained the problem to users being served out of the US-EAST region.  Our European members were unaffected.
  • Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.
Not all of Netflix’s best-laid plans worked as planned, though — even its oft-touted Chaos Moneys:
  • Chaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose.  This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army.

Considering that Netflix bet the farm on AWS as its infrastructure platform for the foreseeable future, it had better be bullish on the cloud. Of course, that bet also comes with perks, as Netflix is one of AWS’s banner customers. Orzell and Tseitlin note that Netflix is working closely with AWS “on eliminating single points of failure that can cause region-wide outages and isolating the failures of individual zones.”

As distributed systems expert Geoff Arnold explained to me earlier this week, though, solving the problem entirely is likely a pipe dream. As cloud systems grow more and more complex, new and unforeseen problems will keep popping up as the old ones get fixed. If Netflix and AWS can fix the present issue, maybe it will be a while until they have to undertake this effort again on something new.

Feature image courtesy of Shutterstock user Daniel Hebert.

You’re subscribed! If you like, you can update your settings

  1. Keith Townsend Friday, July 6, 2012

    Reblogged this on Virtualized Geek and commented:
    I’m grateful for the transparency of Netflix. This helps the overall cloud community understand to complexity of building redundant architectures built on public cloud resources.

  2. Ann Gregory Friday, July 6, 2012

    In the end Netflix will win. Cable TV Co’s are losing customers because of bad customer service and high fees. You can watch regular TV online with just a web connection. The one I use is from the TVDevo website. It’s all online and reliable. Tons of shows both live and on-demand. For movies online I use Netflix. Between these 2 services I no longer need cable TV.

  3. Considering the “outage” affected 7% of virtual instances in a single facility that was part of a single AZ with multiple facilities, I’m pretty sure everyone else is going to get over it too. Juuust a hunch.

  4. Jeff Schneider Saturday, July 7, 2012

    Derick,
    It would be interesting to see how Amazon’s competitors would have done if they had a power outage. Would their recovery-time have been as good? Do the other cloud providers offer the same type of resilience in their as-a-Service offerings?
    Jeff

  5. Greg Arnette Saturday, July 7, 2012

    Commentary about Risk Management Assessment in the cloud – The Cheap Cloud versus The Reliable Cloud – apps make the difference – http://www.gregarnette.com/blog/2012/07/the-cheap-cloud-versus-the-reliable-cloud/

Comments have been disabled for this post