2 Comments

Summary:

As a result of last week’s four-day cloud computing outage, both Amazon Web Services and its users have publicly come under fire for their practices. I think both groups will take these criticisms to heart, but I don’t think we should expect anything too drastic.

pillory

During and after last week’s four-day cloud computing outage, both Amazon Web Services and its users have publicly come under fire for their practices. AWS, customers and skeptics argue, needs to be more transparent and could stand to learn a thing or two about cloud-based network storage and SLAs. AWS users — and all cloud users — cloud supporters and many analysts argue, could stand to learn a lot about how to design their cloud application infrastructures to handle failures on their cloud provider’s end. I think both groups will take these criticisms to heart, but I don’t think we should expect anything too drastic when it comes to how they actually respond.

How AWS Might Change Its Ways

  • It will redesign Elastic Block Storage. AWS has a history of continuously improving services and responding to customer demands, and this might be the biggest demand it has faced thus far. But it almost has to completely redesign EBS if the criticism from the past few months is accurate. The service has been plagued with performance problems and, according to Joyent’s Jason Hoffman, was pretty much destined for a failure of this type. According to RightScale’s Thorsten von Eicken, the EBS failure appeared to span Availability Zones, meaning a problem in one zone could affect users’ abilities to use the service in other zones, even if the zones aren’t part of the problem.
  • It will offer an SLA for EBS. Another big issue is the absence of an SLA for either the EBS or Relational Database Service. Because AWS offers SLAs only for EC2 and S3, both of which were technically operating fine during the outage, it has been reported that customers won’t get even service credits for their troubles. If this is indeed the case (I’ve reached out to Amazon for confirmation and will update the post if I hear back), AWS might need an SLA might to keep current EBS users or attract new ones after all this negative publicity. Zencoder is offering service credits to its customers affected by this outage even though, contractually, it didn’t have to. I don’t think Amazon will resort to following Zencoder’s lead.
  • It will offer automated multi-region deployment. Multi-Availability-Zone deployment is the gold standard for high availability using Amazon’s current suite of features. But this outage affected multiple Availability Zones in the same geographic region, meaning even customers who abide by the practice may have experienced downtime. Following in AWS’s trend of continuous improvement, I think the service will make automated multi-region deployments an option to ensure applications and data stay up, even if an entire geographic region goes down.
  • It won’t change much else. Transparency, increased liability beyond the SLAs, better standard support — these things will remain at the status quo. Amazon is a large company that, unlike many startup cloud providers, doesn’t have to rely on openness to appease its customers when something goes wrong. It can be as transparent as it deems necessary because 1) it’s currently the best cloud platform available; and 2) it already has their business. Look at the now week-long Sony Network outage  affecting roughly 75 million users, and about which Sony has been mum until today telling customers hackers have all their data. That’s an epic failure, yet I’m not really considering abandoning my PlayStation. In large part, AWS’s relative lack of transparency throughout this outage and other, similar events might be about limiting its liability in case some disgruntled customers decide to sue. Its Customer Agreement is already nearly airtight — and for good reason (can you imagine suddenly being on the hook for lost lives because of a networking issue?) — but there are openings. Better for AWS to deliver a well-thought-out explanation after the fact than say something stupid in the name of openness early on.

How Cloud Users Might Change Their Ways

  • They’ll be a lot smarter about what AWS services they use, and how they use them. Interestingly, a good deal of coverage around this outage all but absolved AWS of responsibility and placed much it on users that either didn’t know what they were getting into or didn’t want to spend the money necessary to ensure high availability. Going forward, those customers will have to seriously reconsider and likely re-architect their AWS application infrastructures, because they certainly won’t be getting much pity if a similar event takes them down in the future. This will be even more important for cloud users that process customer transactions or that otherwise use cloud computing to provide services to third parties. Knowing what we know now, companies need to put safeguards in place if they want to operate in the cloud and insulate themselves from liability.
  • Private clouds will look a lot better. AWS is actually fairly reliable, but it doesn’t offer the sense of control that private, on-premise clouds provide. AWS might actually be more reliable than on-premise clouds, but that doesn’t matter. Companies are more comfortable if they have some say over how an issue will be resolved. It’s the same type of cognitive bias that explains why people generally prefer driving to flying despite a lesser likelihood of dying in a plane crash than in a car accident. There are many very good reasons to choose private clouds — compliance, security and flexibility chief among them — and availability might now join that list, warranted or not.
  • OpenStack and vCloud will look a lot better. I’d say other cloud providers will look a lot better, but outages can and have happened to pretty much every cloud provider on the market. In fact, AWS has been among the best in terms of availability and probably will continue to be just that. But projects like OpenStack and VMware’s vCloud initiative offer the promise of many cloud providers all standardized on the same core infrastructure software. That can also be the same software customers are running in their private clouds. The opportunities for hybrid clouds are rich, as is just the promise of being able to move applications elsewhere if something goes down.
  • Amazon’s cloud business won’t be hurt a bit. There are two reasons for this: 1) anyone still using AWS likely will spend more money with the company to make their application architectures more resilient; and 2) AWS, as mentioned above, is still the best cloud around. The outage is a black eye, but it’s a black eye on what’s otherwise an Adonis of cloud computing. Nowhere else can developers have access to a suite of tools that spans from Simple Queue Service to Elastic MapReduce to GPU Instances on a 10 GbE network. For developers who want a robust portfolio of features and aren’t tied to the notion of “five nines” availability, AWS is still a great choice. Even if AWS bleeds a few customers from this outage, there’s plenty of new blood out there to replace it.

At the most, everyone involved in this outage learned harsh lessons about what it means to operate in the cloud. It doesn’t mean cloud computing or AWS is unreliable, or that customers that didn’t plan for disaster deserve ridicule. It means everyone involved needs to assess their options and act accordingly, though. We all knew this was coming: the big cloud event that would put an end to the age of cloud innocence. There are no excuses next time.

Image courtesy of Albert Bridge.

  1. Great explanation and assessment but I still would like to know what hardware or software failed, or is this a network design problem. Who will be assigned the blame? Thanks

    Share
    1. Derrick Harris Tuesday, April 26, 2011

      As I noted, the blame for the failure goes to AWS while the blame for unpreparedness goes to users. As for the cause, without a full postmortem from AWS, it sounds like a network design issue more so than any specific hardware or software failure.

      Share

Comments have been disabled for this post