27 Comments

Summary:

When it comes to the operations of Internet businesses, 99.999 percent uptime, or five nines, is one of the critical metrics of reliability. Yet that metric — essentially the ability to say that users will reliably be able to reach a business’ web site 99.999 percent […]

When it comes to the operations of Internet businesses, 99.999 percent uptime, or five nines, is one of the critical metrics of reliability. Yet that metric — essentially the ability to say that users will reliably be able to reach a business’ web site 99.999 percent of the time — still eludes nearly every of them.

99.999 percent uptime for a web site equates to just 5.26 minutes of downtime per year. That is the total amount of downtime — planned or unplanned — as seen by users. According to a report last month by Pingdom, only three of the top 20 most popular web sites achieved this metric in 2007: Yahoo, AOL and Comcast’s site for high-speed Internet customers (eBay’s site was close, with only six minutes of downtime in 2007). Another report by Pingdom shows that most of the popular social networks did not achieve even three nines (or less than 525.6 minutes of downtime) in the first four months of 2008. Moreover, none achieved anywhere close to 99.999 percent uptime.

When I ask people about web sites that are down more than they would like, the most common response I hear is that the web site is a nice-to-have feature of their lives and not a critical element that they absolutely rely upon. If one search engine is down, you can use another. If a social network is down, then there are other ways of reaching your friends.

In other words, web sites are seen as unreliable, a perception that drives down user adoption, increases churn, reduces page views, limits ad impressions and increases abandoned shopping carts. I believe the converse to also be true — highly reliable web sites have high user loyalty and return rates, lower churn, more page views and advertising revenues and more sales.

While some may dispute the accuracy of Pingdom’s measurements, clearly being unreliable is not a trait that any person or web site strives to achieve. An unreliable person in life is not someone you invite home to meet your parents and not someone you want working on your critical business project. When interviewing candidates for jobs at my portfolio companies I have never heard anyone refer to their redeeming quality of being unreliable and absent more than expected — so why is that same quality so visible on web sites across the Internet?

It’s always better to be known as the reliable person in, any situation, whether for personal or business reasons. I wish that the Internet web sites that I use frequently in both aspects of my life had the same redeeming quality. Is 99.999 percent uptime too much to ask?

By Allan Leinwand

You're subscribed! If you like, you can update your settings

  1. Allan – good perspective on the space. The focus has been on site load time and growing page views vs. uptime. Load times are a big problem for social networks when they open up their platforms. This is the fire drill that they must fix or face the Freindster near death experience!

    Share
  2. Why is 5 the right number of nines?

    When building a system, how do you decide between 4, 5, or 6 nines? It seems like a cost/benefit tradeoff. Very few services are actually worth the trouble and cost of “high” nines solutions.

    Share
  3. @Chris – true enough about page load times, but it has gotten much better with the proliferation of broadband. Still, you’re right that some applications, social networks in particular, seem to be having some page load time issues. And, of course, uptime issues as well….

    @PJ – if you are building a Web1.0 or Web2.0 service and don’t think it is worth the trouble and cost of delivering high reliability then I suspect that will ultimately show in your website performance, uptime and user adoption. Who spends time, money and sweat on building something that is not worth the trouble of making it reliable?

    Share
  4. It’s a hard metric to even measure.

    The sites mentioned are large, and on CDN’s (self made or services they pay for), which means they are served out of multiple data centers on multiple networks.

    Pingdom is run out of multiple data centers as well.

    In order for Pingdom’s service to not have a margin of error, they need 100% uptime (and 100% uptime for their upstream network providers). Otherwise a problem on either end is “downtime”.

    The big question is where the downtime is taking place, and how widespread it is. Often it’s on the network level and results in certain locations being unable to access a site for a period of time… is that up or down? How slow is “down”?

    Uptime will slowly improve, but hopefully the metric will become a little more black and white. It’s designed for a time when a site was on 1 server run off an ISDN. Not for todays complex hosting.

    Share
  5. “Is 99.999 percent uptime too much to ask?”

    In the context of this discussion, I’d say yes. Everything is a tradeoff, and the sacrifices made for those nines — especially after the third — are tremendous.

    Let’s take Facebook as an example. This is a site undergoing staggering growth in traffic over the last few years. At the same time, they’ve radically changed the features and scope of the site, not the least of which was building in a third-party application platform(!). I don’t believe it’s possible for them to innovate at the pace they have while maintaining a perfect reliability record. The cost in people, servers, process, and culture to go from four to five nines — saving just 47 minutes a year — seem to me not the best use of the company’s resources.

    Somewhere down the road, when a company achieves stability in its market and is sitting on huge cash reserves — Yahoo, eBay, AOL, Comcast — then this investment may make sense. Certainly it makes sense when lives are on the line, say for aviation and phone systems. But I don’t think we should measure Internet companies against the nines without considering what they’ll trade for them.

    Share
  6. [...] in getting stuck into organisations who provide services over the web for their lack of up-time. Allan Leinwand at Gigamo is too, only he’s found some empirical evidence.  (This is the actual service [...]

    Share
  7. As with any bit of engineering, the old saw: “Good, Fast, Cheap: choose two” applies.

    Since folks are almost always going to select based on price and performance rather than reliability, the bulk of the market is driven away from reliability.

    Share
  8. Ahh RFC 1925, words to live by.

    As an ops person intimately familiar with uptime, I can tell you, you get what you pay for. Development speed, always comes at the expense of reliability. 9 out 10 times, an app / site is not built with scalability or reliability in mind. As it grows, you are forced to re-architect and scale each piece of infrastructure (read replace) within the context of a startup budget, and little down time as possible. The faster the growth the more difficult it becomes. Think about it, like trying to change the engine in a car while moving 60 miles per hour, it is hard complex task, sometimes you are going to crash. While Ebay, Comcast, and AOL are high on the list now, that is only because of the massive amounts of capex they can dump on the problem. Imagine the uptime stats if they were measuring in ’99.

    Share
  9. @Robert Accettura – it’s been a long time since a website was connected via an ISDN link, true enough. Today’s sites are complex and do use CDNs (which should improve uptime) but the increased complexity should lead to both an increase in functionality and reliability. More features and functions that are unreachable and unusable does not really help with pageviews, transaction rates or sell through rates.

    @James Byers – Points well taken. I guess I’ll look for alternative sites to the ones I typically use until one proves more reliable than the other.

    @Tony Li – I’m just surprised that one of the criteria that you chose was “cheap” :) :)

    @NickH – I did not have RFC1925 in mind, but I see your reference to (7a) for Tony… I tend to think that (4) applies here as well ;)
    http://www.faqs.org/rfcs/rfc1925.html

    Share
  10. Investing in anything beyond three nines is a waste of time and money. Invest in meeting your users’ needs quickly, and they’ll be thrilled to forgive you the other 500 minutes a year. It’s exactly what AOL, Y!, and Comcast don’t do and why fewer users love them every day.

    Share

Comments have been disabled for this post