99.999….The Quest for Reliability on the Internet

27 Comments

When it comes to the operations of Internet businesses, 99.999 percent uptime, or five nines, is one of the critical metrics of reliability. Yet that metric — essentially the ability to say that users will reliably be able to reach a business’ web site 99.999 percent of the time — still eludes nearly every of them.

99.999 percent uptime for a web site equates to just 5.26 minutes of downtime per year. That is the total amount of downtime — planned or unplanned — as seen by users. According to a report last month by Pingdom, only three of the top 20 most popular web sites achieved this metric in 2007: Yahoo, AOL and Comcast’s site for high-speed Internet customers (eBay’s site was close, with only six minutes of downtime in 2007). Another report by Pingdom shows that most of the popular social networks did not achieve even three nines (or less than 525.6 minutes of downtime) in the first four months of 2008. Moreover, none achieved anywhere close to 99.999 percent uptime.

When I ask people about web sites that are down more than they would like, the most common response I hear is that the web site is a nice-to-have feature of their lives and not a critical element that they absolutely rely upon. If one search engine is down, you can use another. If a social network is down, then there are other ways of reaching your friends.

In other words, web sites are seen as unreliable, a perception that drives down user adoption, increases churn, reduces page views, limits ad impressions and increases abandoned shopping carts. I believe the converse to also be true — highly reliable web sites have high user loyalty and return rates, lower churn, more page views and advertising revenues and more sales.

While some may dispute the accuracy of Pingdom’s measurements, clearly being unreliable is not a trait that any person or web site strives to achieve. An unreliable person in life is not someone you invite home to meet your parents and not someone you want working on your critical business project. When interviewing candidates for jobs at my portfolio companies I have never heard anyone refer to their redeeming quality of being unreliable and absent more than expected — so why is that same quality so visible on web sites across the Internet?

It’s always better to be known as the reliable person in, any situation, whether for personal or business reasons. I wish that the Internet web sites that I use frequently in both aspects of my life had the same redeeming quality. Is 99.999 percent uptime too much to ask?

27 Comments

Jonathan Heiliger

This has been said by other commenters… there is generally an equivalent (or greater) order of magnitude cost increase with every 9. Furthermore, reliability typically comes at the expense of innovation; Gartner, et. al. have cited >50% of downtime is caused by human error. That’s not to say that intelligent systems can’t be designed to compensate, however that drives my former argument of incremental cost.

In this age of web 2.0ism’s, developers prefer flexibility of innovation to reliability. Building reliable systems becomes more important as an application begins to mature.

Allan Leinwand

@Ken Godskind – Since S3 only promised 99.9% uptime, if I was running a top Internet site I doubt I would use that service given that it lowers my reliability threshold immediately.

@JonK – yep, you’re right. Thanks for pointing that out.

@Mike Nolet – good point. For most high traffic sites every hour is peak hour given the global nature of Internet traffic.

Mike Nolet

One thing nobody seems to have mentioned is that there are huge differences between one 99.9 site and another. If a site goes down for an hour at peak time once a month that will piss off users quite a bit more than going down for 2 minutes every night.

JonK

You might want to check that source: the reality is even worse than you think if the reference is correct.

See, the post in question is from April *2007*: in other words, those numbers are for the first three months of last year, in which case 5 9’s would look something more like 80-odd seconds and everyone bar Yahoo gets a resounding helping of FAIL.

Ken Godskind

I have been talking to service providers about this topic for years. When taking the public network and access into account, it simply is not possible to reliably achieve five 9s from the end-users perspective.

There are just too many factors outside of the control of any services provider or enterprise like peering or routing issues between the data center and the user/customer.

Of course, with the appropriate software, equipment and procedures it should be possible to achieve five 9s from the point of demarc where the computing resources connect to the public network. Assuming you are choosing the appropriate two of the good, fast and
cheap set as mentioned above.

Probably the best place to start is to begin to understand the end-to-end experience you are currently delivering using the appropriate website monitoring technology.

Interesting to note, even Amazon’s S3 online storage has only promised 99.9% uptime: http://tinyurl.com/4cbmfo.

Allan Leinwand

@Scott Rafer – AOL, Comcast and Y! would more than likely have lost users faster if they really did have 525 (8.76 hours) of downtime in a year. Those Web1.0 companies have built a reputation of being reliable and that is a plus for many users. I’ll be in violent agreement with you about usefulness, but that is for a different post :)

@Rodrigo – yes, fast page load (preferably faster then the mythical 8 second human patience limit) is necessary. Websites that are down load in an infinite amount of time.

@Paul – I don’t recall my Internet access being down almost 9 hours last year. The FB, Google and S3 outages are a major issue – how long do we have to endure their suckiness?

Daniel Udsen

99.0% is more then anyone would ever demand of any real world infrastructure, as long as downtime don’t mean dead people. Not even the power grid have 99.999% uptime.

The 99.999 is really reserved for airplanes backup systems, safety equipment and medical equipment like pacemakers.

it’s not that the web is seen as less reliable it’s just that the world have always been like that, 99% uptime is better then most railways. The fact that were even talking about 99.99% uptime on web pages tell the exact opposite picture, people see the web as more reliable then the real world.

Paul

First: Any of these sites is probably more reliable than any one user’s internet access.

Second: As the six sigma folks figured out, a complex six sigma final product requires its components to be orders or magnitude (more 9’s) more reliable, redundant, or both.

When Facebook wants to be the web’s user directory, and Google wants to be its directory, and Amazon is pitching to be its storage system, outages (and upgrade incompatibilities) will stack. It will suck for a while at least, as anyone who has tried to retrieve a United Airlines itinerary at the last minute can probably understand.

Stephen Pierzchala

As someone who works for a Web performance measurement company (not Pingdom, and Google will provide the answer), I can say that Pingdom’s numbers are likely HIGH.

It’s not just a matter of being up/available. The site has to be usable. It not only has to be up and fast, it has to be consistent. It has to be efficient.

As we have seen with a number of fast growing social networking businesses / projects, there is a struggle to cope with exponential user growth and system scalability. I am not an expert in these areas; I am an analyst who reports on the times where this does not work.

Allan wrote earlier about the madness that occurs when you ignore the network. Now he is pointing out that systems are human creations.

99.999% is nearly impossible. The objective is to aim for a site that is consistent, predictable, and treats it’s customers as rational human beings, not “users”.

@spierzchala

Scott Rafer

Investing in anything beyond three nines is a waste of time and money. Invest in meeting your users’ needs quickly, and they’ll be thrilled to forgive you the other 500 minutes a year. It’s exactly what AOL, Y!, and Comcast don’t do and why fewer users love them every day.

Allan Leinwand

@Robert Accettura – it’s been a long time since a website was connected via an ISDN link, true enough. Today’s sites are complex and do use CDNs (which should improve uptime) but the increased complexity should lead to both an increase in functionality and reliability. More features and functions that are unreachable and unusable does not really help with pageviews, transaction rates or sell through rates.

@James Byers – Points well taken. I guess I’ll look for alternative sites to the ones I typically use until one proves more reliable than the other.

@Tony Li – I’m just surprised that one of the criteria that you chose was “cheap” :) :)

@NickH – I did not have RFC1925 in mind, but I see your reference to (7a) for Tony… I tend to think that (4) applies here as well ;)
http://www.faqs.org/rfcs/rfc1925.html

NickH

Ahh RFC 1925, words to live by.

As an ops person intimately familiar with uptime, I can tell you, you get what you pay for. Development speed, always comes at the expense of reliability. 9 out 10 times, an app / site is not built with scalability or reliability in mind. As it grows, you are forced to re-architect and scale each piece of infrastructure (read replace) within the context of a startup budget, and little down time as possible. The faster the growth the more difficult it becomes. Think about it, like trying to change the engine in a car while moving 60 miles per hour, it is hard complex task, sometimes you are going to crash. While Ebay, Comcast, and AOL are high on the list now, that is only because of the massive amounts of capex they can dump on the problem. Imagine the uptime stats if they were measuring in ’99.

Tony Li

As with any bit of engineering, the old saw: “Good, Fast, Cheap: choose two” applies.

Since folks are almost always going to select based on price and performance rather than reliability, the bulk of the market is driven away from reliability.

James Byers

“Is 99.999 percent uptime too much to ask?”

In the context of this discussion, I’d say yes. Everything is a tradeoff, and the sacrifices made for those nines — especially after the third — are tremendous.

Let’s take Facebook as an example. This is a site undergoing staggering growth in traffic over the last few years. At the same time, they’ve radically changed the features and scope of the site, not the least of which was building in a third-party application platform(!). I don’t believe it’s possible for them to innovate at the pace they have while maintaining a perfect reliability record. The cost in people, servers, process, and culture to go from four to five nines — saving just 47 minutes a year — seem to me not the best use of the company’s resources.

Somewhere down the road, when a company achieves stability in its market and is sitting on huge cash reserves — Yahoo, eBay, AOL, Comcast — then this investment may make sense. Certainly it makes sense when lives are on the line, say for aviation and phone systems. But I don’t think we should measure Internet companies against the nines without considering what they’ll trade for them.

Robert Accettura

It’s a hard metric to even measure.

The sites mentioned are large, and on CDN’s (self made or services they pay for), which means they are served out of multiple data centers on multiple networks.

Pingdom is run out of multiple data centers as well.

In order for Pingdom’s service to not have a margin of error, they need 100% uptime (and 100% uptime for their upstream network providers). Otherwise a problem on either end is “downtime”.

The big question is where the downtime is taking place, and how widespread it is. Often it’s on the network level and results in certain locations being unable to access a site for a period of time… is that up or down? How slow is “down”?

Uptime will slowly improve, but hopefully the metric will become a little more black and white. It’s designed for a time when a site was on 1 server run off an ISDN. Not for todays complex hosting.

Allan Leinwand

@Chris – true enough about page load times, but it has gotten much better with the proliferation of broadband. Still, you’re right that some applications, social networks in particular, seem to be having some page load time issues. And, of course, uptime issues as well….

@PJ – if you are building a Web1.0 or Web2.0 service and don’t think it is worth the trouble and cost of delivering high reliability then I suspect that will ultimately show in your website performance, uptime and user adoption. Who spends time, money and sweat on building something that is not worth the trouble of making it reliable?

PJ

Why is 5 the right number of nines?

When building a system, how do you decide between 4, 5, or 6 nines? It seems like a cost/benefit tradeoff. Very few services are actually worth the trouble and cost of “high” nines solutions.

Chris Albinson

Allan – good perspective on the space. The focus has been on site load time and growing page views vs. uptime. Load times are a big problem for social networks when they open up their platforms. This is the fire drill that they must fix or face the Freindster near death experience!

Comments are closed.