S3 Outage Highlights Fragility of Web Services

Updated with Statement from Amazon: Amazon’s S3 cloud storage service went offline this morning for an extended period of time — the second big outage at the service this year. In February, Amazon suffered a major outage that knocked many of its customers offline.

It was no different this time around. I first learned about today’s outage when avatars and photos (stored on S3) used by Twinkle, a Twitter-client for iPhone, vanished.

My big hope was that it would come back soon, but popular S3 clients such as SmugMug were offline for more than eight hours — an awfully long time for Amazon’s Web Services division to bring back the service. As our sister blog, WebWorkerDaily, points out:

With two relatively serious outages in the space of 6 months, some will be asking the question of why depend on S3? The answer is simple: the rates are hard to beat, especially for service that doesn’t require any sysadmin budget.

That said, the outage shows that cloud computing still has a long road ahead when it comes to reliability. NASDAQ, Activision, Business Objects and Hasbro are some of the large companies using Amazon’s S3 Web Services. But even as cloud computing starts to gain traction with companies like these and most of our business and communication activities are shifting online, web services are still fragile, in part because we are still using technologies built for a much less strenuous web.

Update: Antonio Rodrigez, founder of Tabblo, now part of HP, on his blog asks the $64,000 pertinent question:

…if AWS is using Amazon.com’s excess capacity, why has S3 been down for most of the day, rendering most of the profile images and other assets of Web 2.0 tapestry completely inaccessible while at the same time I can’t manage to find even a single 404 on Amazon.com? Wouldn’t they be using the same infrastructure for their store that they sell to the rest of us?

Update #2: Building an offline redundancy for Amazon S3 could be big opportunity, Dave Winer says.

Update #3: A reader sent me an email and asked these two questions

  • Is the system designed to be fault tolerant? If yes, then how did it go down? After all they must have massive arrays and mirrors of their storage infrastructure.
  • Is this a hardware failure or a software/design problem?

Random Thought: The S3 outage points to a bigger (and a larger) issue: the cloud has many points of failure – routers crashing, cable getting accidentally cut, load balancers getting misconfigured, or simply bad code.

Update/Statement from Amazon in response to our questions:

As a distributed system, the different components of S3 need to be aware of the state of each other.  For example, this awareness makes it possible for the system to decide which redundant physical storage server to route a request to.

We experienced a problem with those internal system communications, leaving the components unable to interact properly, and customers unable to successfully process requests.  After exploring several alternatives, the team determined it had to take the service offline to restore proper communication and then bring service online again.

These are sophisticated systems and it generally takes a while to get to root cause in such a situation—we will be providing our customers with more information when we’ve fully investigated the incident.  We’re proud of our operational performance in operating S3 for almost 2.5 years, and our customers have generally been pleased with the reliability and performance of the service. But any downtime is unacceptable and we won’t be satisfied until it is perfect.

Amazon S3 is used heavily by a number of services behind Amazon’s retail websites.  Those services were impacted, but the retail website did not show noticeable problems because it mostly uses cached data.


loading

Comments have been disabled for this post