Why Amazon Went Down, and Why It Matters

By Alistair Croll | Friday, June 6, 2008 | 2:32 PM PT | 38 comments |

Amazon.com’s U.S. retail site became unavailable around 10:25 AM PST, and now appears to be back up. Amazon’s not naming names — all that director of strategic communications Craig Berman would say was that: “Amazon’s systems are very complex and on rare occasions, despite our best efforts, they may experience problems.”

Berman did confirm, however, that neither Amazon Web Services nor international sites were affected.

So what happened? Let’s look at the facts.

  • Traffic to https://www.amazon.com was getting there. So DNS was configured properly to send traffic to Amazon’s data centers. Global server load balancing (GSLB) is the first line of defense when a data center goes off the air. Either GSLB didn’t detect that the main data center was down, or there was no spare to which it could send visitors.
  • When traffic hit the data center, the load balancer wasn’t redirecting it. This is the second line of defense, designed to catch visitors who weren’t sent elsewhere by GSLB.
  • If some of the servers died, the load balancer should have taken them out of rotation. Either it didn’t detect the error, or all the servers were out. This is the third line of defense.
  • Most companies have an “apology page” that the load balancer serves when all servers are down. This is the fourth line of defense, and it didn’t work either.
  • The HTTP 1.1 message users saw shows something that “speaks” HTTP was on the other end. So this probably wasn’t a router or firewall.

This sort of thing is usually caused by a misconfigured HTTP service on the load balancer. But that would happen late at night, be detected, and rolled back. It could also happen from a content delivery network (CDN) not retrieving the home page properly.

So my money’s on an AFE or CDN problem. But as Berman notes, Amazon’s store is a complex application and much of their infrastructure doesn’t follow “normal” data center design. So only time (and hopefully Amazon) will tell.

Site operators can learn from this: Look into GSLB, and make sure you have geographically distributed data centers (possibly through AWS Availability Zones.) It’s another sign we can’t take operations for granted, even in the cloud.

Digg

Comments (38)

Link to this article using http://om.bit.ly/363NrE
  • Excellent and timely piece Om.

    Not relevant to your blog, but wanted to put. I hate your search. It is really bad. When I search for “GigaOM Show”, it does not even come in the first 10 links.

    One more which I would like to point out – can not go back using the back button (I use Opera) and I do not think you are using AJAX.

    Please do something on it, at least on search.

    Satya

    Satya Narayan Dash — 2:46 PM on June 6, 2008
      Reply
  • Sorry. It is written by Alistair and thanks Alistair for the timely piece.

    Others comments – I reall want the search to be better.

    Satya

      Reply
  • I humbly suggest the more accurate title “I Have No Idea Why Amazon Went Down, Either.”

      Reply
  • Oh, my, I hit my limit. Translation, please?

      Reply
  • Frankly, I wish I’d thought to grab the HTTP headers at the time, which would have told me a lot more.
    As a friend of mine points out, the thing that’s answering this “speaks” HTTP. Which means it’s likely a proxy of some sort — it’s conversant, but the thing behind it isn’t. And there were multiple DNS entries in the DNS response, so whatever happened blanketed several sites
    This could be the caching layer, the security layer, or anything else designed to sit between the Internet and the application servers themselves.

      Reply
  • More than likely a human error, internal misconfig… as you can balance traffic prior to it hitting the network…

    http://www.ultradns.com/technology/dnsshield.html

      Reply
  • Oh, and @Liz: Some kind of proxy service died that likely lives in the data center, in front of the app server. That could be an HTTP-aware firewall or something else, likely a custom Amazon thing.

    The bigger point here is that there’s always a point of failure, and web systems are a complex mixture of technologies.

      Reply
  • Thanks, Alistair. I got lost somewhere there in the middle of Om’s analysis.

      Reply
  • Time for all webservices to be 100% transparent on uptime issues. Trust.salesforce.com led the way, and interesting the new Acrobat.com also has a health.acrobat.com service …

      Reply
  • Thanks for this uninformed piece. I forwarded it to my friends for a laugh.

    To anyone the least bit knowledgeable, it’s pretty obvious that whatever happened was a huge deal. It wasn’t as simple as “some front ends went down”. It was truly an epic fail.

      Reply
  • I do not agree on highly complex thing that Amazon is claiming. Replication or load balancing is not rocket science or something which deserves a Nobel prize. Tomcat is giving it for free, though volume of transaction will be low wihout code modification.

    Security layer is not the case, you would have got HTTPS headers. Neither is caching layer.

    What is most suprising is that apology page did not work, too! It seems not to be a software failure, rather a complete blackout of hardware, which is stunning.

      Reply
  • I think people expect too much from web services. It makes sense to report it as news, but Google News has some 200+ redundant stories about the site going down and still know one has an answer. I’m not going to worry too much about someone not being able to buy a Kindle for a whopping 3 hours!

      Reply
  • This speculation is ridiculous; admitting Amazon’s system is complex and then continuing to guess at causes of the outage is a fool’s game. Without information, which this ‘article’ has none of, you have no idea what did or did not happen.

      Reply
  • Wow. Are you sure you want to stick to being a journalist? You seem to have a calling (and perhaps aspire?) to be an operations guys. It would’ve been much more interesting if you had taken the business impact angle.

    By the way, have you ever run an operation of such size and complexity or do you just enjoy being a pundit? A very shallow and disappointing piece, I must say.

      Reply
  • Yea, dude, Amazon is a mess. There is not a simple reason, it is literally 5-8 year old custom written perl and some other programming language I forgot.

    Every developer who works there has to carry a pager around during a rotation…what does that tell you?

    It is actually more surprising that Amazon only goes down as infrequently as it does.

    Ask anyone who has worked there, under the hood, it is hardly a pillar of robust well architected software. They got it to work and they pretty much just tack shit on at this point.

    Seattle-ite — 11:59 PM on June 6, 2008
      Reply
  • Shoot! I meant Alistair’s analysis, not Om’s. I must remember to check the byline.

      Reply
  • It’s not overly interesting that Amazon’s store was down from an online retailer perspective. It *is* interesting that the people selling EC3, S3, etc, can’t keep their core retail site up. More and more online startups are just thin wrappers around Amazon services, if they can’t keep one of their own sites up why should I have confidence they can keep mine up? Yeah, they’re probably different internal groups, shoemaker’s children, blah blah — that’s their problem, not mine.

    fehwalker — 5:20 AM on June 7, 2008
      Reply
  • @ satya,

    since we use the wordpress.com platform we are kind of limited in the search abilities by them. However, we are working with them to improve and streamline out search feature and soon will give you ability to search across the network with proper headlines, synopsis and date etc. hang in there buddy.

    @Liz, just to clarify, Alistar wrote the story, I didn’t.

      Reply
  • word on the street: DOS Attack

      Reply
  • Thanks Om. Appreciate your response.

    You know what – for one content search wrt to GigaOM Show, I had to google and point to your website.

      Reply
  • Missed an important one.

    I am with Alistair’s article and initial analysis. Your website informed me first on this. And it will help me to influence my circle on decisions regarding the importance of robust software infrastructure. Thanks again.

    @Liz, none of my comments were aimed at you or anyone in person.

    It is this kind of news, comments and analysis, which makes software so lovable and exciting. And may be that is the reason your website is loved so much also.

    Satya

      Reply
  • @ryan: Sorry you felt that way. Glad you got a laugh, then.

    @mel: I doubt it’s an outage due to a single product, given that they handle peak loads at Christmas and so on, although it’s possible. Remember Amazon got into the AWS business to use up excess capacity it had from such peaks.
    Further investigation into the custom headers on the site suggest there’s a proxy layer that wasn’t working; what’s strange is that the layer went down across all IPs (which, admittedly, seem to go to the same router.)

    @peterb: From a business standpoint it looks like lots of people have already estimated the cost of downtime based on annual revenues. This analysis ignores two things: The peaks and valleys of purchasing patterns, and the fact that many of those buyers will simply return later. And yeah, I’ve helped run some big sites, FWIW, but most were standard three-tiered environments.

    @Geo: Websense and Narus are saying it’s probably not DDOS. The fact that traffic was working (albeit with the wrong HTTP contents) supports this to a degree.

      Reply
  • No load balancer that I know of has a configurable “Apology page” when all real servers in the pool are unreachable. To the load balancer, the apology page would be seen as another active server and added to the rotation along with the normal httpd servers. Most of the time maintenance pages are manually brought up and down by the administrators.

    You’ve also left out an important, but interesting failure mode in your analysis. There’s a good chance that users were directed (through GSLB and/or standard load balancing) to a set of nodes which had functioning HTTP but a non-functional back end. This is a very difficult situation for the load balancers to detect, as a server is answering the phone but then doing nothing once the “call” has been completed.

    Too bad for Amazon, really. People need to take Internet operations far more seriously and understand that Amazon probably shares many of the same problems that Twitter is having right now: Rapid growth from increased demand, scaling, and redundancy.

      Reply
  • Wasn’t amazon hit by the massive global botnet attack yesterday? My boss was talking about it. Youtube went down yesterday as well. Ditto imdb.

    zulubanshee — 2:18 PM on June 7, 2008
      Reply
  • @john: You’re spot on about the detection, though: Administrators who do a simple HTTP up/down check on their load balancers, rather than looking for known strings in the page, wind up having “valid” but broken pages served to the outside world.
    BTW I know lots of sites that have a policy-based apology rule (using something like F5’s iControl) but as you point out, that only works when the load balancer knows something’s awry.

      Reply
  • we use http://www.winkstreaming.com they provide us with both global dns and dns load balancing, as well as content cache and site redirection. there are other companies like internap and even ultra dns that offer similar products… strange amazon should invest

      Reply
  • @john adams – check out haproxy, keepalived, ldirector, mod_throttle (?)
    they all have automated sorry server capabilities. http://haproxy.1wt.eu/ is very sweet in my opinion, lots of features and flexibility and speed.

    The web servers or some other monitoring system (monit maybe) should have been able to detect the dead backend servers and remove those from the mix.

    I am off to get more salty snacks for the boss…

      Reply
  • Alastair and Geo,

    It most certainly was a DOS attack, I assure you.

    And Seattle-ite is right about everything he says in his post, and that was in part why the DOS was possible.

    robotthink — 2:28 PM on June 8, 2008
      Reply
  • Seems like, The site is down again….

    I was thinking of Amazon as Google in Shopping domain (item search, reviews etc..).
    Time to re-tink???

    Avneesh Balyan — 10:21 AM on June 9, 2008
      Reply
  • @John Adams

    I think you haven’t been looking around very much. Modern load balancers (application delivery controllers) a la F5 BIG-IP, have configurable “apology” pages when all nodes are down. This technology has been around for quite a while, it’s not something new or unknown in the industry at all.

    Lori

      Reply
  • Check out David Scott’s interview at the Business Forum: http://www.businessforum.com/DScott_02.html. It seems Amazon (and a lot of other organizations suffering data thefts, outages, bad projects, etc.) needs his book!

      Reply
  • This may be a dumb question, but did the Amazon cloud go down too?

      Reply
  • @john adams: Netscaler load balancers (as Amazon is rumored to use) can re-direct to a ‘Sorry Page’ when all backend services are down. It’s a simple config, and we require it on all our load balanced services.

    That, and the fact that the HTTP error page that was presented looks just like the one generated by Netscalers when operating in proxy mode, indicates to me that the load balancer layer was up & functional, but there was nothing behind it to which to send traffic, and that there was no redirect enabled.

    Having said that, re-directing a high volume site to a sorry page is a challenge itself. We maintain a load balanced pool of servers on a separate pair of load balancers, just to handle the sorry page from the load balanced applications.

    –Mike

      Reply
  • @Mel: I think that rumor is hilarious too. It wouldn’t surprise me that a bunch of gamers writing scripts to auto-buy available items could crash Amazon. That would be awesome.

      Reply
  • I still have the headers in my scroll back buffer:

    $ wget -S http://www.amazon.com -O /dev/null
    –14:22:58– http://www.amazon.com/
    => `/dev/null’
    Resolving http://www.amazon.com... 72.21.210.11
    Connecting to http://www.amazon.com|72.21.210.11|:80… connected.
    HTTP request sent, awaiting response…
    HTTP/1.1 503 Service Unavailable
    Server: NS_6.1
    Content-Length:62
    Connection: close
    14:22:58 ERROR 503: Service Unavailable.

    It appears that they are indeed running Citrix Netscalers (Server: NS_6.1) which is what returned the 503 error you see above.

    “Sorry” pages only work if configured; they are not a default. Maybe Amazon hasn’t gotten around to that. ;)

      Reply
  • So how do you manage and diagnose such complex systems? You are probably talking about 1,000s of devices that all could be the root cause of the issue. How do you isolate the root cause? Classically, you’d have some type of monitor with rules to detect when certain issues occurred. This MIGHT point you to the right location. However, with IT being the main differentiator to the end customer, new changes are constantly being rolled out. The system is always growing, changing and the customer usage patterns are always altering over time. So the rules originally written tend to get lightened up so that you don’t have alert storms. Now with the rules loosened up, you might not detect the failure and, even if you do, the events will not be as helpful in detecting the root cause. You could throw more and more smart people at the problem and constantly update and maintain your set of rules. However, as the system gets more and more complex this human cost will grow at an enormous rate. You need something different, a tool that automatically detects issues and adjusts to the changing system and usage patterns. You need a tool that uses statistical analytics to weed through the noise of the system and determines the relationship between the business information and the IT information in order to allow you to quickly get to the root cause of issues like this. If you got a single alert that told you that the load balances were getting a higher than average reconnect rate, your number of sales was dropping below normal, your average load on all web servers was way below normal, the number of search transactions was way below normal, the normal of connected users was way below normal etc., you would be able to have a quick idea on where to start. No need to experience all your line of defenses being down again.

      Reply
  • It’s too bad about Amazon going down for a spell but maybe it inspired someone or several someones to check out their local library. I hope so.

      Reply

Linkbacks (9)

Subscribe to comments feed

Leave a Reply


Post to GigaOM with your Facebook account

Editorial Masthead

Sebastian Rupley
Editor in Chief
Carolyn Pritchard
Managing Editor
Celeste LeCompte
Special Projects Editor
Desiree DeNunzio
Copyeditor
Om Malik
Senior Writer
Stacey Higginbotham
Staff Writer
Ryan Lawler
Staff Writer
Wagner James Au
Contributing Editor
Liz Gannes
Staff Writer
Chris Albrecht
Staff Writer
Katie Fehrenbacher
Staff Writer
Josie Garthwaite
Staff Writer
Close
E-mail It