47 Comments

Summary:

Amazon.com’s U.S. retail site became unavailable around late Friday morning, West Coast time. The outage, however, more than just being annoying, was instructive. Here’s why it matters.

Amazon.com’s U.S. retail site became unavailable around 10:25 AM PST, and now appears to be back up. Amazon’s not naming names — all that director of strategic communications Craig Berman would say was that: “Amazon’s systems are very complex and on rare occasions, despite our best efforts, they may experience problems.”

Berman did confirm, however, that neither Amazon Web Services nor international sites were affected.

So what happened? Let’s look at the facts.

  • Traffic to http://www.amazon.com was getting there. So DNS was configured properly to send traffic to Amazon’s data centers. Global server load balancing (GSLB) is the first line of defense when a data center goes off the air. Either GSLB didn’t detect that the main data center was down, or there was no spare to which it could send visitors.
  • When traffic hit the data center, the load balancer wasn’t redirecting it. This is the second line of defense, designed to catch visitors who weren’t sent elsewhere by GSLB.
  • If some of the servers died, the load balancer should have taken them out of rotation. Either it didn’t detect the error, or all the servers were out. This is the third line of defense.
  • Most companies have an “apology page” that the load balancer serves when all servers are down. This is the fourth line of defense, and it didn’t work either.
  • The HTTP 1.1 message users saw shows something that “speaks” HTTP was on the other end. So this probably wasn’t a router or firewall.

This sort of thing is usually caused by a misconfigured HTTP service on the load balancer. But that would happen late at night, be detected, and rolled back. It could also happen from a content delivery network (CDN) not retrieving the home page properly.

So my money’s on an AFE or CDN problem. But as Berman notes, Amazon’s store is a complex application and much of their infrastructure doesn’t follow “normal” data center design. So only time (and hopefully Amazon) will tell.

Site operators can learn from this: Look into GSLB, and make sure you have geographically distributed data centers (possibly through AWS Availability Zones.) It’s another sign we can’t take operations for granted, even in the cloud.

You’re subscribed! If you like, you can update your settings

  1. Satya Narayan Dash Friday, June 6, 2008

    Excellent and timely piece Om.

    Not relevant to your blog, but wanted to put. I hate your search. It is really bad. When I search for “GigaOM Show”, it does not even come in the first 10 links.

    One more which I would like to point out – can not go back using the back button (I use Opera) and I do not think you are using AJAX.

    Please do something on it, at least on search.

    Satya

  2. Sorry. It is written by Alistair and thanks Alistair for the timely piece.

    Others comments – I reall want the search to be better.

    Satya

  3. I humbly suggest the more accurate title “I Have No Idea Why Amazon Went Down, Either.”

  4. Oh, my, I hit my limit. Translation, please?

  5. Alistair Croll Friday, June 6, 2008

    Frankly, I wish I’d thought to grab the HTTP headers at the time, which would have told me a lot more.
    As a friend of mine points out, the thing that’s answering this “speaks” HTTP. Which means it’s likely a proxy of some sort — it’s conversant, but the thing behind it isn’t. And there were multiple DNS entries in the DNS response, so whatever happened blanketed several sites
    This could be the caching layer, the security layer, or anything else designed to sit between the Internet and the application servers themselves.

  6. More than likely a human error, internal misconfig… as you can balance traffic prior to it hitting the network…

    http://www.ultradns.com/technology/dnsshield.html

  7. Alistair Croll Friday, June 6, 2008

    Oh, and @Liz: Some kind of proxy service died that likely lives in the data center, in front of the app server. That could be an HTTP-aware firewall or something else, likely a custom Amazon thing.

    The bigger point here is that there’s always a point of failure, and web systems are a complex mixture of technologies.

  8. Thanks, Alistair. I got lost somewhere there in the middle of Om’s analysis.

  9. Jason M. Lemkin Friday, June 6, 2008

    Time for all webservices to be 100% transparent on uptime issues. Trust.salesforce.com led the way, and interesting the new Acrobat.com also has a health.acrobat.com service …

  10. Thanks for this uninformed piece. I forwarded it to my friends for a laugh.

    To anyone the least bit knowledgeable, it’s pretty obvious that whatever happened was a huge deal. It wasn’t as simple as “some front ends went down”. It was truly an epic fail.

Comments have been disabled for this post