73 Comments

Summary:

Updated with Statement from Amazon: Amazon’s S3 cloud storage service went offline this morning for an extended period of time — the second big outage at the service this year. In February, Amazon suffered a major outage that knocked many of its customers offline. It was […]

Updated with Statement from Amazon: Amazon’s S3 cloud storage service went offline this morning for an extended period of time — the second big outage at the service this year. In February, Amazon suffered a major outage that knocked many of its customers offline.

It was no different this time around. I first learned about today’s outage when avatars and photos (stored on S3) used by Twinkle, a Twitter-client for iPhone, vanished.

My big hope was that it would come back soon, but popular S3 clients such as SmugMug were offline for more than eight hours — an awfully long time for Amazon’s Web Services division to bring back the service. As our sister blog, WebWorkerDaily, points out:

With two relatively serious outages in the space of 6 months, some will be asking the question of why depend on S3? The answer is simple: the rates are hard to beat, especially for service that doesn’t require any sysadmin budget.

That said, the outage shows that cloud computing still has a long road ahead when it comes to reliability. NASDAQ, Activision, Business Objects and Hasbro are some of the large companies using Amazon’s S3 Web Services. But even as cloud computing starts to gain traction with companies like these and most of our business and communication activities are shifting online, web services are still fragile, in part because we are still using technologies built for a much less strenuous web.

Update: Antonio Rodrigez, founder of Tabblo, now part of HP, on his blog asks the $64,000 pertinent question:

…if AWS is using Amazon.com’s excess capacity, why has S3 been down for most of the day, rendering most of the profile images and other assets of Web 2.0 tapestry completely inaccessible while at the same time I can’t manage to find even a single 404 on Amazon.com? Wouldn’t they be using the same infrastructure for their store that they sell to the rest of us?

Update #2: Building an offline redundancy for Amazon S3 could be big opportunity, Dave Winer says.

Update #3: A reader sent me an email and asked these two questions

  • Is the system designed to be fault tolerant? If yes, then how did it go down? After all they must have massive arrays and mirrors of their storage infrastructure.
  • Is this a hardware failure or a software/design problem?

Random Thought: The S3 outage points to a bigger (and a larger) issue: the cloud has many points of failure – routers crashing, cable getting accidentally cut, load balancers getting misconfigured, or simply bad code.

Update/Statement from Amazon in response to our questions:

As a distributed system, the different components of S3 need to be aware of the state of each other.  For example, this awareness makes it possible for the system to decide which redundant physical storage server to route a request to.

We experienced a problem with those internal system communications, leaving the components unable to interact properly, and customers unable to successfully process requests.  After exploring several alternatives, the team determined it had to take the service offline to restore proper communication and then bring service online again.

These are sophisticated systems and it generally takes a while to get to root cause in such a situation—we will be providing our customers with more information when we’ve fully investigated the incident.  We’re proud of our operational performance in operating S3 for almost 2.5 years, and our customers have generally been pleased with the reliability and performance of the service. But any downtime is unacceptable and we won’t be satisfied until it is perfect.

Amazon S3 is used heavily by a number of services behind Amazon’s retail websites.  Those services were impacted, but the retail website did not show noticeable problems because it mostly uses cached data.

  1. Good report Om.

    Maybe the problem isn’t only with the webservices. The problem may be with those who hire an extremely cheap service expecting to have the same uptime of a more expensive service.

    I saw Scoble interview with a guy with HP research and he pointed out that to make cheaper and greener data centers we should review the way we work on SLAs. Why have an extremely reliable (and expensive) platform when you are not storing and processing extremely vital data? Twitter profile pictures are not so vital, are they?

    I work on a small startup that is moving part of its databases and backups to Amazon. For the price, we don’t expect 100% SLA at all. For the part of our applications we need all kinds of redundancies in place, we now we will keep paying more.

    Best,
    Flavio
    Sao Paulo – Brazil

    Share
  2. Amazon never said that they shared the AWS infrastructure with Amazon e-commerce; the man said that they used the same design (on his high availability blog).

    The industry needs not only off-line mirroring of S3 type services, but third party certification and a pool of insurers for business continuity ratings and policies.

    Share
  3. @Alan

    The industry needs not only off-line mirroring of S3 type services, but third party certification and a pool of insurers for business continuity ratings and policies.

    Did you just give away a business idea :-)

    Share
  4. Also, I think the point Antonio is making is the “legend” part of the story. It would be worth asking them the tough question. :-)

    Share
  5. So, sometimes, your “real” hosting platform goes down. I just happens…

    You complain, and you get a (usually tiny) refund. I think the real $64,000 question is:

    Do you get better availability with AWS or your hosted platform?

    (I don’t use AWS, but am interested in the responses)

    Share
  6. I like these types of writeups, just from the title I knew its one by Om and hence worth reading.

    To answer Antonio Rodrigez’s question: At CloudCamp a few weeks/months ago I asked the Amazon WS guy this exact question. His natural answer, which is what we all know and fear, is that even though S3/EC2 are the result of work on Amazon.com’s own infrastructure, they are in-fact separate services which Amazon doesn’t yet entrust with its own web assets.

    Regardless of this I believe these services have a relatively great uptime and that they will improve. The community is reacting to AWS current position by using them only for background computational tasks and hosting stuff like avatars/images. SmugMug took a risk and got burned. When they get better we’ll see entire deployments moving over.

    Share
  7. Interesting times and I am sure that every IT department has experienced serious outages ranging from 4 – 8 hours or greater and at what cost. SLA’s will become imperative to future SaaS based technologies and how we measure SLA’s will also change. I believe that Amazon is just touching the top of the iceberg with S3 and they will continue to build the redundancy, improve the architecture and moderate SLA’s to support the future of web computing.

    In working with CIO’s around the world, we have been talking about the power of smart technologies like S3 and other companies like Salesforce.com and OpSource where they are creating real value for their customers. Given the history of Amazon, I would expect them to minimize the downtimes and provide an over arching architecture to support not only the future growth but reliability that Amazon has proven over the last decade.

    Lonnie Wills
    Blog: http://saasevolution.blogspot.com/

    Share
  8. [...] GigaOM : S3 Outage Highlights Fragility of Web Services Tags: Amazon, Smugmug, [...]

    Share
  9. [...] Service (S3) suffered its second big service outage this year on Sunday, July 20, notes GigaOm. Om Malik reports that some S3 services were down for eight [...]

    Share
  10. [...] cloud all that great? However, this is indeed horrible for the whole “cloud” concept. As Om Malik points out, the last outage in February and now this certainly underlines the fact that web applications are [...]

    Share

Comments have been disabled for this post