Last week’s Amazon Web Services outage might have outsmarted Netflix’s Chaos Monkeys, but the content-distribution giant isn’t about to turn its back on cloud computing. According to a Friday blog post from the Netflix cloud team, the outage (which started with a generator failure and resulted in a cascading bug that took down AWS’s Elastic Load Balancer feature) exposed some flaws in Netflix’s operations both within and beyond its control, but it was a relatively small blip in what has been better overall availability since the company made the move entirely to the cloud.
That the AWS outage resulted in a control plane backlog that prohibited customers from failing over into Availability Zones not affected by the generator failure was Amazon’s fault. However, Netflix’s Greg Orzell and Ariel Tseitlin write, the outage also highlighted some problems with its own load-balancing architecture that ended up compounding the problem by “essentially caus[ing] gridlock inside most of our services as they tried to traverse our middle-tier.” Netflix is working to fix this problem.
Still, they note, Netflix has had better overall uptime since moving to the cloud and is “still bullish on the cloud.” In part, that’s because Netflix has been able to architect its cloud-based services to be resilient even when AWS fails. Some of those decisions proved wise last week:
- Regional isolation contained the problem to users being served out of the US-EAST region. Our European members were unaffected.
- Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.
- Chaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose. This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army.
Considering that Netflix bet the farm on AWS as its infrastructure platform for the foreseeable future, it had better be bullish on the cloud. Of course, that bet also comes with perks, as Netflix is one of AWS’s banner customers. Orzell and Tseitlin note that Netflix is working closely with AWS “on eliminating single points of failure that can cause region-wide outages and isolating the failures of individual zones.”
As distributed systems expert Geoff Arnold explained to me earlier this week, though, solving the problem entirely is likely a pipe dream. As cloud systems grow more and more complex, new and unforeseen problems will keep popping up as the old ones get fixed. If Netflix and AWS can fix the present issue, maybe it will be a while until they have to undertake this effort again on something new.
Feature image courtesy of Shutterstock user Daniel Hebert.