Why Amazon Went Down, and Why It Matters

47 Comments

Amazon.com’s U.S. retail site became unavailable around 10:25 AM PST, and now appears to be back up. Amazon’s not naming names — all that director of strategic communications Craig Berman would say was that: “Amazon’s systems are very complex and on rare occasions, despite our best efforts, they may experience problems.”

Berman did confirm, however, that neither Amazon Web Services nor international sites were affected.

So what happened? Let’s look at the facts.

  • Traffic to http://www.amazon.com was getting there. So DNS was configured properly to send traffic to Amazon’s data centers. Global server load balancing (GSLB) is the first line of defense when a data center goes off the air. Either GSLB didn’t detect that the main data center was down, or there was no spare to which it could send visitors.
  • When traffic hit the data center, the load balancer wasn’t redirecting it. This is the second line of defense, designed to catch visitors who weren’t sent elsewhere by GSLB.
  • If some of the servers died, the load balancer should have taken them out of rotation. Either it didn’t detect the error, or all the servers were out. This is the third line of defense.
  • Most companies have an “apology page” that the load balancer serves when all servers are down. This is the fourth line of defense, and it didn’t work either.
  • The HTTP 1.1 message users saw shows something that “speaks” HTTP was on the other end. So this probably wasn’t a router or firewall.

This sort of thing is usually caused by a misconfigured HTTP service on the load balancer. But that would happen late at night, be detected, and rolled back. It could also happen from a content delivery network (CDN) not retrieving the home page properly.

So my money’s on an AFE or CDN problem. But as Berman notes, Amazon’s store is a complex application and much of their infrastructure doesn’t follow “normal” data center design. So only time (and hopefully Amazon) will tell.

Site operators can learn from this: Look into GSLB, and make sure you have geographically distributed data centers (possibly through AWS Availability Zones.) It’s another sign we can’t take operations for granted, even in the cloud.

47 Comments

Avneesh Balyan

Seems like, The site is down again….

I was thinking of Amazon as Google in Shopping domain (item search, reviews etc..).
Time to re-tink???

robotthink

Alastair and Geo,

It most certainly was a DOS attack, I assure you.

And Seattle-ite is right about everything he says in his post, and that was in part why the DOS was possible.

Rooby Roo

@john adams – check out haproxy, keepalived, ldirector, mod_throttle (?)
they all have automated sorry server capabilities. http://haproxy.1wt.eu/ is very sweet in my opinion, lots of features and flexibility and speed.

The web servers or some other monitoring system (monit maybe) should have been able to detect the dead backend servers and remove those from the mix.

I am off to get more salty snacks for the boss…

Games Lord

we use http://www.winkstreaming.com they provide us with both global dns and dns load balancing, as well as content cache and site redirection. there are other companies like internap and even ultra dns that offer similar products… strange amazon should invest

Alistair Croll

@john: You’re spot on about the detection, though: Administrators who do a simple HTTP up/down check on their load balancers, rather than looking for known strings in the page, wind up having “valid” but broken pages served to the outside world.
BTW I know lots of sites that have a policy-based apology rule (using something like F5’s iControl) but as you point out, that only works when the load balancer knows something’s awry.

zulubanshee

Wasn’t amazon hit by the massive global botnet attack yesterday? My boss was talking about it. Youtube went down yesterday as well. Ditto imdb.

john adams

No load balancer that I know of has a configurable “Apology page” when all real servers in the pool are unreachable. To the load balancer, the apology page would be seen as another active server and added to the rotation along with the normal httpd servers. Most of the time maintenance pages are manually brought up and down by the administrators.

You’ve also left out an important, but interesting failure mode in your analysis. There’s a good chance that users were directed (through GSLB and/or standard load balancing) to a set of nodes which had functioning HTTP but a non-functional back end. This is a very difficult situation for the load balancers to detect, as a server is answering the phone but then doing nothing once the “call” has been completed.

Too bad for Amazon, really. People need to take Internet operations far more seriously and understand that Amazon probably shares many of the same problems that Twitter is having right now: Rapid growth from increased demand, scaling, and redundancy.

Alistair Croll

@ryan: Sorry you felt that way. Glad you got a laugh, then.

@mel: I doubt it’s an outage due to a single product, given that they handle peak loads at Christmas and so on, although it’s possible. Remember Amazon got into the AWS business to use up excess capacity it had from such peaks.
Further investigation into the custom headers on the site suggest there’s a proxy layer that wasn’t working; what’s strange is that the layer went down across all IPs (which, admittedly, seem to go to the same router.)

@peterb: From a business standpoint it looks like lots of people have already estimated the cost of downtime based on annual revenues. This analysis ignores two things: The peaks and valleys of purchasing patterns, and the fact that many of those buyers will simply return later. And yeah, I’ve helped run some big sites, FWIW, but most were standard three-tiered environments.

@Geo: Websense and Narus are saying it’s probably not DDOS. The fact that traffic was working (albeit with the wrong HTTP contents) supports this to a degree.

Satya

Missed an important one.

I am with Alistair’s article and initial analysis. Your website informed me first on this. And it will help me to influence my circle on decisions regarding the importance of robust software infrastructure. Thanks again.

@Liz, none of my comments were aimed at you or anyone in person.

It is this kind of news, comments and analysis, which makes software so lovable and exciting. And may be that is the reason your website is loved so much also.

Satya

Satya

Thanks Om. Appreciate your response.

You know what – for one content search wrt to GigaOM Show, I had to google and point to your website.

Om Malik

@ satya,

since we use the wordpress.com platform we are kind of limited in the search abilities by them. However, we are working with them to improve and streamline out search feature and soon will give you ability to search across the network with proper headlines, synopsis and date etc. hang in there buddy.

@Liz, just to clarify, Alistar wrote the story, I didn’t.

fehwalker

It’s not overly interesting that Amazon’s store was down from an online retailer perspective. It *is* interesting that the people selling EC3, S3, etc, can’t keep their core retail site up. More and more online startups are just thin wrappers around Amazon services, if they can’t keep one of their own sites up why should I have confidence they can keep mine up? Yeah, they’re probably different internal groups, shoemaker’s children, blah blah — that’s their problem, not mine.

Liz

Shoot! I meant Alistair’s analysis, not Om’s. I must remember to check the byline.

Seattle-ite

Yea, dude, Amazon is a mess. There is not a simple reason, it is literally 5-8 year old custom written perl and some other programming language I forgot.

Every developer who works there has to carry a pager around during a rotation…what does that tell you?

It is actually more surprising that Amazon only goes down as infrequently as it does.

Ask anyone who has worked there, under the hood, it is hardly a pillar of robust well architected software. They got it to work and they pretty much just tack shit on at this point.

peter b.

Wow. Are you sure you want to stick to being a journalist? You seem to have a calling (and perhaps aspire?) to be an operations guys. It would’ve been much more interesting if you had taken the business impact angle.

By the way, have you ever run an operation of such size and complexity or do you just enjoy being a pundit? A very shallow and disappointing piece, I must say.

lodestar

This speculation is ridiculous; admitting Amazon’s system is complex and then continuing to guess at causes of the outage is a fool’s game. Without information, which this ‘article’ has none of, you have no idea what did or did not happen.

john

I think people expect too much from web services. It makes sense to report it as news, but Google News has some 200+ redundant stories about the site going down and still know one has an answer. I’m not going to worry too much about someone not being able to buy a Kindle for a whopping 3 hours!

Satya

I do not agree on highly complex thing that Amazon is claiming. Replication or load balancing is not rocket science or something which deserves a Nobel prize. Tomcat is giving it for free, though volume of transaction will be low wihout code modification.

Security layer is not the case, you would have got HTTPS headers. Neither is caching layer.

What is most suprising is that apology page did not work, too! It seems not to be a software failure, rather a complete blackout of hardware, which is stunning.

ryan

Thanks for this uninformed piece. I forwarded it to my friends for a laugh.

To anyone the least bit knowledgeable, it’s pretty obvious that whatever happened was a huge deal. It wasn’t as simple as “some front ends went down”. It was truly an epic fail.

Jason M. Lemkin

Time for all webservices to be 100% transparent on uptime issues. Trust.salesforce.com led the way, and interesting the new Acrobat.com also has a health.acrobat.com service …

Liz

Thanks, Alistair. I got lost somewhere there in the middle of Om’s analysis.

Alistair Croll

Oh, and @Liz: Some kind of proxy service died that likely lives in the data center, in front of the app server. That could be an HTTP-aware firewall or something else, likely a custom Amazon thing.

The bigger point here is that there’s always a point of failure, and web systems are a complex mixture of technologies.

Alistair Croll

Frankly, I wish I’d thought to grab the HTTP headers at the time, which would have told me a lot more.
As a friend of mine points out, the thing that’s answering this “speaks” HTTP. Which means it’s likely a proxy of some sort — it’s conversant, but the thing behind it isn’t. And there were multiple DNS entries in the DNS response, so whatever happened blanketed several sites
This could be the caching layer, the security layer, or anything else designed to sit between the Internet and the application servers themselves.

Dan

I humbly suggest the more accurate title “I Have No Idea Why Amazon Went Down, Either.”

Satya

Sorry. It is written by Alistair and thanks Alistair for the timely piece.

Others comments – I reall want the search to be better.

Satya

Satya Narayan Dash

Excellent and timely piece Om.

Not relevant to your blog, but wanted to put. I hate your search. It is really bad. When I search for “GigaOM Show”, it does not even come in the first 10 links.

One more which I would like to point out – can not go back using the back button (I use Opera) and I do not think you are using AJAX.

Please do something on it, at least on search.

Satya

Comments are closed.