Netflix clearly has learned from its Christmas Eve outage, which involved the failure of Amazon Web Services’ Elastic Load Balancing service, and has created a tool called Isthmus to solve the problem.
In a Friday blog post, Netflix’s Ruslan Meshenberg explained that Isthmus manages Elastic Load Balancing services in multiple regions in order to keep latency low for users in the event that ELB goes down in one region. On Christmas Eve, the issue was that state data got deleted, causing issues for the control plane tasked with managing load-balancer configuration and bringing some down some ELB load balancers. Now that sort of error could be less likely to affect Netflix service.
Isthmus builds on the newly open-sourced Zuul tool for managing the Netflix API and Elastic Load Balancer services that my colleague Barb Darrow wrote about earlier this week. “Zuul is at the core of the Isthmus setup — it forwards all of user traffic and establishes the bridge (or an Isthmus) between 2 AWS regions,” Meshenberg wrote.
Now, Netflix can split user traffic evenly between a US-East region and a US-West region. It’s testing to see if production-scale traffic can be shifted to one region in the event of another outage.
Netflix constructs a lot of tools to make applications running on AWS work better, and Zuul and Isthmus are just the latest. Netflix Cloud Architect Adrian Cockcroft will discuss some of them at our Structure conference next week, as well as the Open Connect content-delivery network Netflix built using its own custom hardware.
Presumably, once Netflix makes Isthmus available as an open-source service, other AWS customers could adopt it and finagle it to fit their own deployments. Then again, AWS might adjust itself accordingly. One would think that would be in Amazon’s best interest as it strives to gain more enterprise customers.