6 Comments

Summary:

Danny Kolke, Founder and CTO of Etelos, maker of the Etelos Marketplace and the Etelos Platform, is a thought leader in the areas of software as a service and Web-based applications. Danny works with developers and businesses alike building and distributing Web-based software as a service. […]

Danny Kolke, Founder and CTO of Etelos, maker of the Etelos Marketplace and the Etelos Platform, is a thought leader in the areas of software as a service and Web-based applications. Danny works with developers and businesses alike building and distributing Web-based software as a service. Known for his honest assessments and sense of humor, Danny is a regular speaker on SaaS, especially its challenges and opportunities.

For the past nine years, I have spent my energy delivering services to users of web-based applications. In that time, I have heard many different marketing messages targeted toward business users, some of which I react to more negatively than others. One of the most deceptive promises I have heard is delivering “five nines,” or 99.999 percent uptime.

In a calendar year of 525,600 minutes, 99.999 percent uptime means that your services would only be interrupted for five minutes and 15.36 seconds. Does this mean that for the other 525,994 minutes and 45 seconds your service is available? I guess it depends on how you define available.

Pick your favorite Web site or Web application. If that service has been interrupted for more than 5 minutes, 15 seconds in the last 12 months, then it doesn’t have true 99.999 percent uptime. In my opinion, when you can’t get to or use a Web application, it’s not up. Regardless of the reason.

We have recently seen Amazon.com, Amazon’s EC2, Google’s Google Apps, Salesforce, Twitter and others struggle with outages of various sizes and causes. Amazon’s recent outage lasted for more than an hour and Twitter seemed to make news for when it was actually available.

Amazon’s outage was enough to kill 99.999 percent as an average uptime for the next 10 years.

A true “five nines” where services are always available (only five minutes and 15 seconds of downtime over 365 days) is an enormous expense to pursue. And with the most well funded sites on the planet failing to deliver “five nines,” is this possible or even worth pursuing?

For those die-hards who look at this and laugh because you “…have machines that have gone years without a reboot…” I have to ask, how do you calculate uptime?

A server in our infrastructure has gone more than three years without a reboot. But, while this server has remained up, it has not always been available. Recently, planned maintenance interrupted access to this machine for one hour. And even though the machine was up, it was not accessible and it might as well have been off.

Thousands of existing accounts experienced no loss of service because their applications were not dependent on the corporate Web site. Technically, a disruption of service occurred, but it did not affect our existing customers, which is not generally part of uptime calculation. In addition, planned outages that disrupt service are usually not considered outages. However, as a business user, “off-hours” planned outages eat away at available working hours in your application. 1 a.m. Monday in Cupertino is 9 a.m. in London. When an application is not available, that’s downtime.

Another area that may not be considered an outage is slow performance. If a user experiences 15-second page loads 21 times, this adds up to 5 minutes and 15 seconds. Enough downtime to make 99.999 percent uptime impossible. Okay, maybe I am being unfair with that one with 15-second calculations, but what about if you went to a Web site that took one minute for a page to load? Six of those and 99.999 percent uptime is toast. Do you call it up or down?

How about this example: My MacBook Pro can take three minutes and 30 seconds to reboot. Two reboots a year and I have consumed more allotted time than 99.999 percent uptime allows. How about an update that takes 5 to 10 minutes to install? On a machine-by-machine basis, maintenance code updates and services need to be performed. This means that either the machines are running old code, or 99.999 percent uptime is simply unrealistic. I suggest that promising “five nines” is a marketing tactic that is virtually impossible to ensure.

I think you can see that depending on how you calculate it, if the goal is to provide service for your users 99.999 percent of the time; it’s virtually impossible and even the biggest (and some may argue the best) infrastructures are struggling with it.

A true “five nines” promise is an enormous expense to pursue. And, with some of the largest sites on the planet failing to deliver, is it even possible or worth pursuing?

  1. There are many ways to shrink outage durations, no matter whether the outage was scheduled (planned) or unscheduled (whoops…!). It takes a study of the times consumed in the outage, in its various phases. Shrink each phase, or any phase, and outage duration is reducad, uptime is increased, and we are … closer …. to all those nines…

    When it comse to unscheduled outages, one must debug and resolve the problem that caused that ourage, rapidly. It is a lost art, perhaps, but it can be regained (yes, I can help). Problems DO OCCUR! Expect them! Use pre-existing system features to capture root-cause information, and obtain products that supplement that. BUt gee whiz, prolbems do occur, you know. EXPECT THEM!

    Share
  2. Totally agree. Amazon’s recent cloud services outages proved that even for an organisation of their scale, achieving five nines is simply impossible. Add to that the fact that the typical end user on a web browser experiences perhaps three nines, and you can see the inanity of the many companies that actually do claim to achieve five nines – strangely none offer audited proof of their achievements!
    What is more important is application design that attempts to address outages. If your system effectively empties the user’s shopping cart all over the floor when there is a stoppage, then you have turned the duration of the outage as perceived by the user from a 2 minute one to a ten minute one – and infuriated them to boot as they re-key all of their data.

    Share
  3. I agree and disagree. I agree that 99.999 percent uptime is not possible since it is not measurable. I disagree that simply because Amazon is not perfect that a given service system cannot be perfect. I also believe that perfection is cheaper than imperfection. One need only look at the total cost of a system failure in engineering time, customer time and lost to reputation.

    It is far too easy to say zero failures is not possible. When was the last time you rebooted your HP calculator? or the GPS network was down?

    Share
  4. Of course 5 nine’s is possible for a web service! It may not be possible for a single provider however. I think the future of Cloud Computing will be in multi-vendor deployments, where for example a web application can be distributed between Amazon’s Cloud, Googles App Engine and other Cloud Platform’s (plug Hosting365′s Cloud in Dublin, Ireland here!) and therefore not reliant on the uptime of any single supplier.

    The issue then becomes the application itself and if it is patched and a problem arises across the multiple sites, then the site is down. I think there a simple ways to mitigate this though – always have a second site stable version of the application that can fail-over too in the event of an update crashing it.

    Share
  5. Build the website with erlang. With erlang it is possible to build services with 5 nines, including planned maintenance.

    Share
  6. Five nines is very hard to implement but also very expensive. I’ve written an article also about Five nines on my blog.

    Share

Comments have been disabled for this post