Five Nines is Still Not Enough

Danny Kolke, Founder and CTO of Etelos, maker of the Etelos Marketplace and the Etelos Platform, is a thought leader in the areas of software as a service and Web-based applications. Danny works with developers and businesses alike building and distributing Web-based software as a service. Known for his honest assessments and sense of humor, Danny is a regular speaker on SaaS, especially its challenges and opportunities.

For the past nine years, I have spent my energy delivering services to users of web-based applications. In that time, I have heard many different marketing messages targeted toward business users, some of which I react to more negatively than others. One of the most deceptive promises I have heard is delivering “five nines,” or 99.999 percent uptime.

In a calendar year of 525,600 minutes, 99.999 percent uptime means that your services would only be interrupted for five minutes and 15.36 seconds. Does this mean that for the other 525,994 minutes and 45 seconds your service is available? I guess it depends on how you define available.

Pick your favorite Web site or Web application. If that service has been interrupted for more than 5 minutes, 15 seconds in the last 12 months, then it doesn’t have true 99.999 percent uptime. In my opinion, when you can’t get to or use a Web application, it’s not up. Regardless of the reason.

We have recently seen Amazon.com, Amazon’s EC2, Google’s Google Apps, Salesforce, Twitter and others struggle with outages of various sizes and causes. Amazon’s recent outage lasted for more than an hour and Twitter seemed to make news for when it was actually available.

Amazon’s outage was enough to kill 99.999 percent as an average uptime for the next 10 years.

A true “five nines” where services are always available (only five minutes and 15 seconds of downtime over 365 days) is an enormous expense to pursue. And with the most well funded sites on the planet failing to deliver “five nines,” is this possible or even worth pursuing?

For those die-hards who look at this and laugh because you “…have machines that have gone years without a reboot…” I have to ask, how do you calculate uptime?

A server in our infrastructure has gone more than three years without a reboot. But, while this server has remained up, it has not always been available. Recently, planned maintenance interrupted access to this machine for one hour. And even though the machine was up, it was not accessible and it might as well have been off.

Thousands of existing accounts experienced no loss of service because their applications were not dependent on the corporate Web site. Technically, a disruption of service occurred, but it did not affect our existing customers, which is not generally part of uptime calculation. In addition, planned outages that disrupt service are usually not considered outages. However, as a business user, “off-hours” planned outages eat away at available working hours in your application. 1 a.m. Monday in Cupertino is 9 a.m. in London. When an application is not available, that’s downtime.

Another area that may not be considered an outage is slow performance. If a user experiences 15-second page loads 21 times, this adds up to 5 minutes and 15 seconds. Enough downtime to make 99.999 percent uptime impossible. Okay, maybe I am being unfair with that one with 15-second calculations, but what about if you went to a Web site that took one minute for a page to load? Six of those and 99.999 percent uptime is toast. Do you call it up or down?

How about this example: My MacBook Pro can take three minutes and 30 seconds to reboot. Two reboots a year and I have consumed more allotted time than 99.999 percent uptime allows. How about an update that takes 5 to 10 minutes to install? On a machine-by-machine basis, maintenance code updates and services need to be performed. This means that either the machines are running old code, or 99.999 percent uptime is simply unrealistic. I suggest that promising “five nines” is a marketing tactic that is virtually impossible to ensure.

I think you can see that depending on how you calculate it, if the goal is to provide service for your users 99.999 percent of the time; it’s virtually impossible and even the biggest (and some may argue the best) infrastructures are struggling with it.

A true “five nines” promise is an enormous expense to pursue. And, with some of the largest sites on the planet failing to deliver, is it even possible or worth pursuing?

loading

Comments have been disabled for this post