Netflix is fighting back outages in the Amazon Web Services cloud computing platform with a new architecture that has its streaming-video service running live across multiple regions of the AWS cloud and balancing user traffic between them. It is, to quote Ron Burgundy, kind of a big deal.
It’s big because availability is a big deal, and availability hasn’t always been delivered as some might have expected from the cloud. The term “cloud computing” evokes images of workloads floating across a heavenly blanket of clouds in the sky without a care in the world, but the technical reality has traditionally been more like a handful of cloud scattered across the sky. Your servers are running in one of those clouds: If it dissipates, so does your application.
AWS has rolled out numerous features and architectural strategies (such as regions and availability zones) to mitigate the effects of outages by letting users send their workloads to other virtual servers, but its efforts aren’t foolproof. Last Christmas Eve, for example, a massive outage took down a boatload of applications and websites, the often minimally affected Netflix among them. Netflix first tried to solve this problem by designing its streaming service to failover from one AWS region to another if something went wrong.
In a blog post on Monday, Netflix engineers Ruslan Meshenberg, Naresh Gopalani, Luke Kosewski explained how the company is now running its streaming service simultaneously across AWS’s US East-1 (in Virgina) and US West-2 (in Oregon) and balancing user traffic across them. Regularly, traffic is routed to the region that’s geographically closer to the user (a roughly 50-50 split), but if a whole region goes down as happened on Christmas Eve, Netflix can route all traffic to the healthy region.
It’s only natural that Netflix would be the company to develop such a strategy for achieving maximum cloud uptime — it has (presumably) been AWS’s heaviest and most-innovative customer for a couple years — but this task required the company to rethink even some of its most-renowned software creations. Its Asgard cloud-management tool, for example, was only designed for single-region workloads, so Netflix created a new tool called Mimir to address these new cross-country ones.
Even its vaunted Simian Army — a set of tools for testing Netflix’s resilience to various types of cloud failures — had to be supplemented. Netflix developed a new monkey called Chaos Kong for testing cross-region failover scenarios, and Split-Brain, which simulates a cut connection between its two AWS regions. The company actually got a chance to test out cross-region failover in real life during a recent outage, and it went off largely without a hitch, according to the blog post authors.
However, as impressive as Netflix’s engineering might be, it’s on the cutting edge of how companies are architecting their cloud resources and is by no means attainable by the majority of cloud users. That means there’s a big opportunity for cloud providers to automate this type of capability for their users in an effort to push cloud computing closer to that dream of geographically-aware and resilient applications. Google and AWS, at least, are already working on it, but they appear to be a way off from where Netflix is at this point.