5 Comments

Summary:

Reddit suffered some major downtime yesterday, a situation it largely on a failure on the part of Amazon Web Services. But the real takeaway might have to do with the importance for cloud users to make sure all is well with their cloud deployments.

monalisa egg

Updated: Content-sharing site Reddit suffered some major downtime yesterday, a situation it says was largely the result of a failure on the part of Amazon Web Services and that likely will compel the site to move even more of its infrastructure out of the cloud and back in house roll back its usage of the Elastic Block Storage service. The specific problem was severe performance degradation for “a small subset” of EBS volumes in AWS’s US-EAST-1 data center, which just happened to affect the majority of the disks for Reddit’s Postgres and Cassandra database servers. Rather than looking at this as an example of why cloud computing is a bad idea, though, the real takeaway might have to do with the importance for cloud users to make sure all is well with their cloud deployments.

As Reddit systems administrator Jason Harvey notes in his detailed blog on the outage, this isn’t the first time Reddit has experienced performance issues with EBS. In fact, performance woes had already spurred the site to start moving its Cassandra servers back in house to local storage on EC2 servers and now it is considering doing the same for its Postgres servers. But aside from issues with EBS, Harvey also acknowledges a couple things Reddit could have done better:

  • “One mistake we made was using a single EBS disk to back some of our older master databases (the ones that hold links, accounts and comments). Fixing this issue has been on our todo list for quite a while, but will take time and require a scheduled site outage. This task has just moved to the top of our list.”
  • “When we started with Amazon, our code was written with the assumption that there would be one data center. We have been working towards fixing this since we moved two years ago. Unfortunately, progress has been slow in this area.”

As Harvey explains, there are approaches users can take to avoid suffering this type of consequence, including using multiple Availability Zones or replicating storage across a greater number of disks. I’m not trying to absolve AWS of fault — it certainly deserves a fair amount in this instance and in general, if Reddit’s claims of generally spotty EBS performance are true — but the reality is that only users really suffer from this type of outage. I’ve written before about one-sided cloud computing terms of service (I actually don’t think they’re that unfair), the result of which is that providers probably won’t owe customers anything more than service credits even for the most severe outages (a small token after a major issue), and they most certainly won’t owe more if customers didn’t design their cloud infrastructure as optimally as possible to avoid such a situation. It’s not that AWS isn’t to blame but, rather, that save for a small reputation hit, Reddit’s pain is no skin off AWS’s back. (It might be worth noting that, according to a recent study by CloudHarmony, AWS actually exceeded its SLA for Amazon EC2 over the course late-2009 through late-2010, operating at 100 percent availability in its U.S. Availability Zones.)

In another recent situation, online ticketing startup TicketLeap suffered a database failure in its attempt to carry out opening-day sales for Comic-Con International atop an AWS-hosted infrastructure. In that situation, however, TicketLeap bore the brunt of the blame, shared perhaps a little with some questionable MySQL code. But, as was pointed out by at least one commenter afterward, there are known ways to resolve the specific issue that TicketLeap experienced. TicketLeap, it appears, just might not have done its homework, and was saved from complete disaster by leveraging the cloud’s flexibility and scaling down its server count to a manageable level.

Both Reddit and TicketLeap would certainly acknowledge the benefits of using AWS or any other cloud provider, though. If not for the ability to rent resources, TicketLeap might have spent a small fortune scaling up for Comic-Con only to crash. Even Reddit’s Harvey wrote that:

Amazon’s Elastic Block Service is an extremely handy technology. It allows us to spin up volumes and attach them to any of our systems very quickly. It allows us to migrate data from one cluster to another very quickly. It is also considerably cheaper than getting a similar level of technology out of a SAN.

But the cloud isn’t perfect, even if it is actually far more reliable than many presume, which is why customers need to make sure their deployments are in order and as reliably architected as possible. Because when failures do happen, it’s only customers that end up paying the price in the end.

Image courtesy of Flickr user Carolyn Coles.

  1. This article is inaccurate. We have no plans to leave AWS or EC2, nor cut back it’s use any time soon.

    Share
    1. Derrick Harris Friday, March 18, 2011

      Updated. I misconstrued the move to “local storage” to be in-house, given the situation. I see that the original post has been updated to clarify that it’s a move to EC2 local storage, though.

      Share
  2. I wonder if it’s true that only the customer suffers in cases like this? Reputations are fragile things, and mud sticks.

    This – and other – stories tend to be quick to mention the cloud provider when services such as Reddit go down. Here, the cloud provider would appear to bear some of the responsibility, so it’s probably justified. All too often, though, the outage is all down to the *customer*… yet their cloud provider gets named in the headline when the Dell server and Cisco switch in their local data centre never would…

    Share
    1. Derrick Harris Saturday, March 19, 2011

      That’s a fair point, Paul, but I think the numbers of cloud growth tell the story. AWS, Rackspace and Google have all been in the headlines for outages of some sort, and business is growing across the board. I think people (tempted to put apps in the cloud) realize how many apps are running in the cloud, and how relatively few are affected by mistakes or issues wholly on the providers’ end. But site users get mad when their favorite site is down, and they blame the site.

      Too many of these stories, though — or one really big one — and it might be a different story.

      Share
  3. This is Reddit’s fault. They put all their eggs in one basket and didn’t QA the site. For them to blame Amazon for their own failures is laughable.

    Share

Comments have been disabled for this post