Five Nines on the Net is a Pipe Dream

30 Comments

The New York Times today finally got around to noticing that when web sites go down, people are increasingly likely to get mad and generally react the way I might if I drove to my favorite bar and found it closed for a private party. I might be miffed and share a few choice words with members of my party before deciding on a new locale. However, when we write blogs or tweets (if Twitter is up), the inconvenience and our subsequent vitriol is archived forever and transmitted around the world rather than just to our friends. And because millions of other people want to go to that same bar, the chorus of curses grows quickly.

We’ve written about how hard it is to create a 99.999 percent up time championed by the telecommunications industry, but suffice to say there are a ton of moving parts involved in keeping a site visible to the end users; the list begins with the network architecture and ends with the internet connection of a consumer in Austin. Along the way there are software upgrades, server shortages, DNS issues, cut cables, corporate firewalls, carriers throttling traffic and infected machines.

The Times notes that downtime is more than just inconvenient: As more data is stored online and cloud computing becomes more prevalent for businesses, it’s less like a bar closing for a night than a bank closing for a day. But it will never be possible to keep all sites across the entire web up 99.999 percent of the time. Knowing that, architecting for failure, and more services such as downforeveryoneorjustme.com (I would really love a more memorable name for this site) and helpful 404 pages would be appreciated.

30 Comments

Mike

Quote: “Knowing that, architecting for failure, and more services such as downforeveryoneorjustme.com (I would really love a more memorable name for this site) and helpful 404 pages would be appreciated.”

Your wish is granted: http://isitfucked.com

Enjoy!

Dave Burstein

Stacey, Om

When I saw your 5 9’s headline, I first tried to think of any dsl or fiber networks that were built for that. Even after I saw you wewre discussing the other end, the web sites, I pondered that question. Our Verizon DSL line has come close to that the last year, while my Time Warner cable drops often several times a day for a few seconds and they refuse to fix it. FIOS and the other fiber builds are designed for very good reliability, including no active elements in the field.

I’m very aware that broadband networks absolutely have not been engineered for anything close to the 5 9’s of the traditional telcos, with single points of failure common in DSL networks. The new fiber nets may be superior, but that’s not yet proven.

But I did realize at least one Internet service that is extraordinarily reliable. The NY ISP Panix has not allowed my email to fail in over five years, and the founder, Alexis Rosen, told me a while back it hadn’t gone down more than a few hours since Panix was founded early in the 1990s as one of the world’s first ISPs.

Reliability is not impossible in the Internet world.

Dave Burstein

Chris Parente

Interesting post. There are bigger brains than mine in this comment thread, but some thoughts I had while reading:

As noted, big difference between five nines for a site, versus five nines for end user

I believe five nines equates to five minutes of down time a year — no, I can’t remember the formula for that number. It’s quite possible there are some businesses that will never need that level of performance

DNS is often not given enough attention from reliability standpoint — a point no doubt made in David’s commentary with Om. Hopefully DNS’s visibility will rise along with cloud computing. It’s also critical for enterprise VOIP deployments

Speaking of DNS security, it will be very interesting to see where ICANN sets the technology bar in their RFPs for the brand new TLDs — scuttlebutt suggests the standards will be on the low side

More informative 404 pages opens the door to redirects for advertising purposes. I’m not saying that’s necessarily bad, but adds complexity and can cause security problems (Earthlink), based on how implemented

Edmund Elkin

Is 5 9’s necessary? It depends upon the specific service, or the type of service, from that website. For example, if it is a road-traffic condition, then no worries if Site A is down, I just go to Site B, which is especially easy if I use a search engine to initially identify those sites (that keeps the bookmarks under control). On the other hand, if it is where I buy tickets for the big baseball game, or my 401(k) site, then I really want access now, and unavailability is not acceptable.
So, it would be useful to have a variable mechanism for ensuring QoS. That would save on the total investment / price, and ensure quality is there when I need it.
Also, remember… 5 9’s refers to “availability” not reliability. Or as the original article correctly referred to it: “up time.”

austinandrew

@ Ross – did you read the article? You write “Go start monitoring Merill Lynch or UBS and see how often their site goes down.” The point is that even if these guys were up 100% of the time, the infrastructure from their server to your home won’t be. That’s why VOIP sucks. Even if the VOIP provider is good, the service is only as good as your cable modem.

Thomas

Five nines is very doable — getting a VC or dot-commer that runs their business off of crappy boxes stuffed in their dorm rooms to pay for the required technology is a pipe dream. Hire infrastructure vendors that play in the communications world and they’ll engineer a 5-9’s system no problem. Continue to get your technology from Open Source and Fry’s and you’ll be lucy to get 3.

Ross

Om,

I guess we have to agree to disagree. People notice when they can’t get to their financials. Lets pick a better site like Etrade or Ameritrade. These companies need uptime and a single bit of downtime can cost them millions.

An even better example is ebay and you can check out their track record here:
http://uptime.pingdom.com/site/month_summary/site_name/www.ebay.com (so far in 2008)

or

http://uptime.pingdom.com/site/month_summary/site_name/www.yahoo.com (look at 2007)

Just because Twitter goes down often doesn’t mean five nines isn’t achievable.

Mark

I love that Ross includes a site that only pings every 5 minutes to try to measure the 9’s of a site. 5 9’s is 5.24 minutes a year, so pingdom.com would be unlikely to detect if a 5 9’s site was ever down.

To even attempt to determine if a site might be 5 9’s you’re going to have to ping a lot more frequently than that. And that’s just getting a ping response. I think most sites would consider availability to include actually being able to deliver a web page, not just a networking ping back.

Travis

Even Google rarely maintains five 9’s. http://royal.pingdom.com/?p=192
Granted the article is almost a year old but only one country’s Google site was at 99.999% And Google has an enormous infrastructure investment and can afford the type of redundancy that is required to maintain that kind of availability. Can it be done? Absolutely. Will most smaller organizations have the resources to do it? Probably not, at least until the cost of a massively redundant infrastructure comes down.

Stacey Higginbotham

Guys, there may be sites that experience five nines (do we count planned outages?), but the point of the post is that that is hard to do because there are so many moving parts, and when it does go down people notice and news spreads quickly. It’s impossible to believe that right now a site will be both up and available to ALL users 99.99 percent of the time.

And on the web it only takes a few users having problems to damage the brand.

om Malik

Ross,

Two things:

1. There is a huge difference in the size/audience/usage/money spent by Wall Street and other enterprises versus regular consumer facing web services. Which is what Stacey was trying to point out.

2. Secondly, when the ML and UBS go down, no one notices and they actively don’t share that information.

Last point, please stop including your URL in the message for it comes across as too self promotional. Not that there is anything wrong with that.

Ross

Om,

What are you talking about? Many companies pull of five nines in a year. Maybe not your latest greatest web 2.0 start up but you’ll be hard pressed to find traditional companies websites down. Go start monitoring Merill Lynch or UBS and see how often their site goes down.

Ross
http://www.hostdisciple.com

Om Malik

@ pwb

Thank you for your short and sweet comment. Is it doable – of course it is. Has it been done? of course not.It is a pipe dream because we are not even close to being done in terms of technology – both hardware and software.

Om Malik

David,

You are being presumptuous in assuming someone doesn’t know how the Internet works. Care to expand. Don’t leave a hanging comment without really a reason.

David Ulevitch

“But it will never be possible to keep all sites across the entire web up 99.999 percent of the time.”

This represents a fundamental misunderstanding of how the Internet works.

Aman Sehgal

Hi Stacey,
I agree that achieving five 9s for Internet is almost impossible. Servers also need some amount of time to rest :) and we can not get away with it. There is always a time for every site to be down and no one knows it when that D-Time will come. You mentioned that 404 pages should be more informative. I also feel the same. If actual reason for the site not working is displayed then it will in one sense mean that site is NOT out of order i.e. it is working in one aspect and not working in another. This information will in fact help the netizen to roughly estimate the time after which the site will be up, running and available for browsing.

pwb

99.999 is a pipe dream for one simple reason: it’s unnecessary. There are very few web-based services that truly need that kind of up-time. Precious company resources are better spent elsewhere.

Roland Dobbins

Achieving five-nines in one’s own public-facing infratructure is most certainly doable, and it is in fact done all the time. The problem is that you’re dependent upon the infrastructure of others (SPs, enterprise networks, mobile networks, the users’ computers/OSes/applications, et. al.), over which you’ve no contol, to deliver packets to you; and once packets leave your infrastructure, they’re once again dependent upon the infrastructure of others, over which you’ve no control, to reach their destinations unmolested.

So, the assertion that ‘Five Nines on the Net is a Pipe Dream’ is really missing the point; it’s more like ‘The Definition of Availability on the Internet is Elusive Due to its Very Nature’, or something along those lines.

Now, a more cogent and useful essay would be one on why so many ‘vital’ Web 2.0 companies simply don’t build their applications and infrastructures so that they can scale, and why, even when they’re wildly successful, they don’t implment the well-known best current practices (BCPs) which would maximize availability and resiliency within their own spans of control.

Petabro

it took the PSTN almost 100 years to get 99.999% reliable. The Net will take a much longer time, probably by the time when Cisco becomes as bureaucratic as the old Western Electric (AT&T Network Systems/Lucent)

Comments are closed.