11 Comments

Summary:

Netflix is now running its streaming service live across two regions of the Amazon Web Services cloud platform, an architectural decision that should avoid a nasty service disruption like the one that struck last Christmas Eve.

Source: Netflix
photo: Netflix

Netflix is fighting back outages in the Amazon Web Services cloud computing platform with a new architecture that has its streaming-video service running live across multiple regions of the AWS cloud and balancing user traffic between them. It is, to quote Ron Burgundy, kind of a big deal.

It’s big  because availability is a big deal, and availability hasn’t always been delivered as some might have expected from the cloud. The term “cloud computing” evokes images of workloads floating across a heavenly blanket of clouds in the sky without a care in the world, but the technical reality has traditionally been more like a handful of cloud scattered across the sky. Your servers are running in one of those clouds: If it dissipates, so does your application.

AWS has rolled out numerous features and architectural strategies (such as regions and availability zones) to mitigate the effects of outages by letting users send their workloads to other virtual servers, but its efforts aren’t foolproof. Last Christmas Eve, for example, a massive outage took down a boatload of applications and websites, the often minimally affected Netflix among them. Netflix first tried to solve this problem by designing its streaming service to failover from one AWS region to another if something went wrong.

In a blog post on Monday, Netflix engineers Ruslan Meshenberg, Naresh Gopalani, Luke Kosewski explained how the company is now running its streaming service simultaneously across AWS’s US East-1 (in Virgina) and US West-2 (in Oregon) and balancing user traffic across them. Regularly, traffic is routed to the region that’s geographically closer to the user (a roughly 50-50 split), but if a whole region goes down as happened on Christmas Eve, Netflix can route all traffic to the healthy region.

Source: Netflix

Source: Netflix

It’s only natural that Netflix would be the company to develop such a strategy for achieving maximum cloud uptime — it has (presumably) been AWS’s heaviest and most-innovative customer for a couple years — but this task required the company to rethink even some of its most-renowned software creations. Its Asgard cloud-management tool, for example, was only designed for single-region workloads, so Netflix created a new tool called Mimir to address these new cross-country ones.

Even its vaunted Simian Army — a set of tools for testing Netflix’s resilience to various types of cloud failures — had to be supplemented. Netflix developed a new monkey called Chaos Kong for testing cross-region failover scenarios, and Split-Brain, which simulates a cut connection between its two AWS regions. The company actually got a chance to test out cross-region failover in real life during a recent outage, and it went off largely without a hitch, according to the blog post authors.

Source Netflix

Source Netflix

However, as impressive as Netflix’s engineering might be, it’s on the cutting edge of how companies are architecting their cloud resources and is by no means attainable by the majority of cloud users. That means there’s a big opportunity for cloud providers to automate this type of capability for their users in an effort to push cloud computing closer to that dream of geographically-aware and resilient applications. Google and AWS, at least, are already working on it, but they appear to be a way off from where Netflix is at this point.

  1. This seems like a natural path of evolution that any piece of software is going to go through in the future. In the past, software was restricted to confined environments because of budget and hardware complexities. With the rise in virtualization and on-demand infrastructure, programmers can segregate their work loads and put them where they work best.

    The initial version of “cloud” was just the spark; now we’re starting to see what the real capabilities are.

    Share
  2. adrian cockcroft Monday, December 2, 2013

    Two points: the code that implements active-active is all open source, so anyone can build an application architected this way. The key projects are Apache Cassandra and NetflixOSS Denominator, Zuul and EVcache. The second point is that Netflix is not only protecting against regional AWS outages, it is protecting against bugs and misconfigurations in its own code that result from rapid deployment of new features. The outage shown is an example of a self inflicted problem.

    Share
    1. Derrick Harris Monday, December 2, 2013

      Adrian, thanks for the comment. I didn’t notice anything specific to the active-active architecture on the OSS page — is that coming or is it all pretty much reworked existing code?

      Share
  3. They’re just now don’t this?

    Share
    1. doing*

      Share
  4. *Please run grammar check*

    Share
  5. Interesting to contrast this load balancing with Verizon’s push to far fewer IP interconnection points with other providers. It simply doesn’t make sense, not only as far as 1-way content is concerned, but more importantly 2-way HD collaborative applications. Imagine 20 2-5mbs HD sessions from 5 towns in a 300sq mile region all being transited 500 miles away and back. Totally inefficient. What are they thinking? Oh, yeah, it’s 1913 all over again; only bigger. The FCC should address this issue next week at its IP transition hearings.

    Share
  6. The title is misleading – it’s not about streaming media traffic, it’s about traffic to the UI/portal/CDN-redirector.

    Share
  7. Where is the innovation here? We were doing this 20 years ago between servers but didn’t call it “cloud”….

    Share
  8. Sriram Gopalakrishnan Wednesday, December 4, 2013

    Isn’t there an inherent flaw in this architecture? This is exactly what causes large scale blackouts. There are times when it makes sense to redirect traffic to another region and there are times when it makes sense to restrict the outage. I hope the actual implementation is smarter than what this article makes it look like.

    Share
  9. Editor in chief: find “availability” and “required the company” and remove redundant words/phrase.

    Share

Comments have been disabled for this post