14 Comments

Summary:

Both Amazon Web Services and Netflix — its most-prominent customer — have released details on AWS outage that took down Netflix’s streaming service on Christmas Eve. AWS attributes the issue primarily to human error. Netflix just wants to avoid this situation again, whatever the cause.

Amazon Web Services has issued a a postmortem of its Christmas Eve cloud computing outage that took many services — most notably Netflix — offline for a portion of the night. The cause, according to AWS: A developer accidentally deleted Elastic Load Balancer state data in Amazon’s US-East region that the service’s control plane needs in order to manage load balancers in that region.

All told, the outage (which began at 12:24 p.m. PT) lasted 23 hours and 41 minutes and, at its peak, crippled 6.8 percent of load balancers in the region while leaving others running — albeit unable to scale or be modified by users. The Elastic Load Balancer team didn’t realize the root cause of the problem for several hours, at which point it began the challenging process of attempting to restore the state data to a point in time just before its accidental deletion. At 12:05 p.m. PT on Dec. 25, AWS announced that all affected load balancers had been restored to working order.

AWS says it has taken multiple steps to ensure this situation doesn’t repeat itself, or at least can be resolved faster should something similar occur. The first — and likely easiest — fix was to incorporate stricter access control to production data of the type that had been deleted. According to the AWS report, that’s typically the case, but the company “had authorized additional [Elastic Load Balancer] access for a small number of developers to allow them to execute operational processes that are currently being automated.”

On the technological side, the company had the following to say:

We have also modified our data recovery process to reflect the learning we went through in this event. We are confident that we could recover ELB state data in a similar event significantly faster (if necessary) for any future operational event. We will also incorporate our learning from this event into our service architecture. We believe that we can reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state. This would allow the service to recover automatically from logical data loss or corruption without needing manual data restoration.

More shocking than the AWS outage, though — they’ve happened before and will almost certainly happen again — is that the Christmas Eve outage actually took down Netflix, which is often cited as the most-advanced AWS user around. It has a host of homegrown tools built specifically for the purpose of monitoring, managing and adding reliability to its AWS-based infrastructure. There’s a reason even President Obama’s tech team relied on the company’s best practices in order to keep its campaign applications up and running during election crunchtime.

Cockroft (center) at Structure 2012c)2012 Pinar Ozger pinar@pinarozger.com

Adrian Cockroft (center) at Structure 2012<br />(c)2012 Pinar Ozger pinar@pinarozger.com

In a blog post on Monday, Netflix cloud guru Adrian Cockroft acknowledged the effects on company’s streaming service and explained how it affected different devices in different ways. Cockroft also provided a mea culpa of sorts, explaining that while Netflix has an impressive track record when AWS outages are confined within Availability Zones, challenges still remain when outages affect significant portions of AWS regions as this one did.

Indeed: Netflix recently survived an outage in October, but was hit by a July outage in which a cascading bug spread across Availability Zones in the US-East region. “We are working on ways of extending our resiliency to handle partial or complete regional outages,” Cockroft wrote.

However, he cautioned, figuring out how to do it correctly will take some work given the complexity of cloud computing infrastructure:

We have plans to work on this in 2013. It is an interesting and hard problem to solve, since there is a lot more data that will need to be replicated over a wide area and the systems involved in switching traffic between regions must be extremely reliable and capable of avoiding cascading overload failures. Naive approaches could have the downside of being more expensive, more complex and cause new problems that might make the service less reliable.

At this point, though, if anyone can figure out build reliable cross-region services on Amazon’s cloud platform, it’s probably Netflix. Although, AWS and other cloud providers will certainly undertake their own work to improve reliability across global data centers, thus making themselves all the more appealing to potential customers.

We’ll have to wait to see how the lastest in a string of 2012 outages for AWS affects CIO sentiment toward the cloud or if, like Cockroft, they’ll take the mindset that’s “it is still early days for cloud innovation” and there’s plenty of time to fix these difficult problems.

Feature image courtesy of Shutterstock user Zastolskiy Victor.

  1. Google App Engine really has much better uptime reliability than AWS. Our company has migrated 90% of cloud apps to GAE and there is absolutely no looking back.

    Share
    1. How can you compare the two? AWS have far more regional availability and global redundancy then any IaaS cloud offering out there. AWS’ infrastructure is far superior then GAE as it stands today.

      Share
      1. But what’s the point of more regional availability if you have no uptime to even access it?

        Share
  2. How bout more movies and newer release for streaming instead of TV crap for focusing on new year I’m bout to drop u because stuff u r showing I can get off official sites without having to pay for them

    Share
  3. Human error continues to be a leading cause of downtime for the cloud as it historically was for the enterprise. The practices for minimizing human error have been around for many years. Unfortunately, service providers and IT teams are slow to adopt them. These service failures will continue until organizations recognize the inevitability of human error and design environments that tolerate these faults. –
    Dan Greller http://www.dangreller.com

    Share
  4. KISS principle – rather than building a bigger, more clever mousetrap, focus on your core competences and just do what you do best.

    Share
  5. Christopher Haire Wednesday, January 2, 2013

    IMO, if I were Netflix, I would consider privatizing some of the AWS components they use. Load balancers are one of them… Netflix probably has a good handle on the traffic they serve at this point. It’s probably way cheaper in the long-run to build their own datacenter and run most of the traffic internally, scaling out to elastic services for peak or non-planned-for usage.

    Share
  6. I would suggest looking at the SLAs and related Service Credits.
    If Netflix are happy with the recompense following Xmas eve, happy days. If not, I would suggest looking at their Procurement Team and process.
    Whatever, human error, where one human is being ‘blamed’ is unacceptable and unprofessional.

    Share
  7. AWS still not ready for prime time?

    Share
  8. David Saintloth Wednesday, January 2, 2013

    The fact remains and as admitted by Netflix itself that if their system were more robust to cross AV issues this would not have impacted their customer base. The reasons for it not being robust given this high profile outage is that a) distributing video streaming is hard and b) distributing it across multiple AV’s is expensive. They are currently still working on the a), the fact that human error happens is why AWS doesn’t give 5 nine’s of service…but having multiple AV’s isolates clients that distribute across more than two AV’s from outages that impact their customer base significantly….video streaming which is inherently real time is *always* going to impact some customers even if Netflix were on all of AWS’s AV’s simply because an interrupted stream has to be picked up again and latency is seen in real time…you just can’t help that but you can prevent it from being a several hours outage to just being a few seconds of a hiccup as the client fails over to other available if not near by AV’s. Netflix is admitting they need to build a more robust design to those possibilities.

    Share
  9. Michelle Accardi Wednesday, January 2, 2013

    They need a strong IT management software partner to help them avoid these t all costs. As these services grow the risk for error where things aren’t automated increases. Application Performance management, infrastructure management, security and automation solutions can help and need to be working in the back end to help these companies from passin gon challenges to their end users.

    Share
  10. The fact that cloud services may be affected due to human errors is something very scary.

    Share
  11. big deal, yawn…

    Share
  12. Nearly two months later and still having issues? Something else up?

    Share

Comments have been disabled for this post