Putting outages into the proper perspective

Which is worse: experiencing a cloud outage or waiting to experience a cloud outage?

Last month, Azure storage services went down and caused customers in the U.S., Europe, and parts of Asia to suffer. As Gigaom’s Barb Darrow reported, several Azure users were not happy. Of course, a status page let users determine what exactly was going on, but Azure’s it reported that everything was hunky-dory. Clearly that was not the case.

Notice that the other public-cloud providers, namely AWS and Google, did not toss stones at Microsoft. Why? They could be next. Indeed, a power outage at an AWS data center in California brought down services for customers on Memorial Day. And don’t forget about Christmas Eve two years ago, when Netflix customers tried to watch their favorite holiday movie only to have their hopes crushed when the streaming service was brought down by an AWS employee error that affected the company’s U.S.-East region. I was a victim of that outage, unable to get my “Santa’s Buddies” fix on Christmas Eve.

More recently, according to Gigaom’s Barb Darrow:

Amazon Web Services’ Content Delivery Network (CDN) experienced some glitches on Thanksgiving Eve, according to various reports all citing the AWS status page. According to that page, users experienced “elevated error rates when making DNS queries for CloudFront distributions: between 4:12 p.m. and 6:02 p.m. PST” on Wednesday, November 26.

So, what does this all mean? Pretty much nothing at this point.

In looking at the one-year status page at Cloud Harmony/Cloud Square, things seem pretty good. While there are some bad outage records, most report more than 99.90 percent uptime (for computing clouds). The larger cloud providers, including AWS, Google, and Microsoft, have experienced some outages in some regions but stay right up near 100 percent uptime, for the most part. There are just a few exceptions here: the outages experienced last month, a few bad apples listed near the end of the report, and those that do not get much traction as public-cloud providers.

There will always be outages. Cloud providers can improve their best practices but they can’t change the laws of physics. Networks will fail, backup power units won’t kick on, and hundreds of other situations can go wrong in a cloud data center. However, that does not seem to translate into much of an impact in service to the enterprise in the larger picture.

When it comes to reliability of cloud services, the real metrics to consider will compare internal enterprise systems to cloud services. If internal enterprise systems normally experience a certain number of outages, most cloud providers will find they can do better, though the core reason that many enterprises push back on public cloud-based systems is a fear of outages and downtime.

Statistics often lead us to the reality of the “cloud versus traditional systems” comparison. Indeed, there are an average of three “business-interruption events” for most enterprises per year. The cost is around $110,000 per hour while the outage is occurring, with an average of five hours per interruption. That’s about $1.65 million per year in enterprise outage costs, and those are in the IT shops that are well managed.

Now, consider what you would have experienced in the cloud over the past year. According to Cloud Harmony and my own experiences, the likelihood is that you won’t experience a public cloud outage at all. If you do, it’s likely to be very short in duration, and hopefully they won’t do what Microsoft did, and instead make sure you get a heads up.

Internal enterprise systems are another story. The uptime records for many enterprises are much worse than the average public-cloud provider. They end up costing the enterprise $1 million-plus. Of course, the trend is not to flog your enterprise IT guys for losing a network for a few hours, but it’s certainly okay to complain if an outside service, such as a public-cloud provider, takes an occasional dirt nap. That’s just human nature. If you’re looking at the impact to the business you need to consider the true behavior of both traditional systems, and those that exist in public clouds.

Based upon the data, it would be a reasonable to assume that most public-cloud providers will have at least one outage, and that outage will last for an average of two hours. Using our metrics in terms of impact to the business, including $110,000 per hour cost, and we’ll suffer $220,000 per year in outage costs (See Figure 1). Again, that’s assuming there is even an outage at all. Thus, public cloud becomes a much better deal, beyond the operational cost savings and the ability to avoid capital expenses.

Figure 1: Even with many outages, public clouds are still a better bet when it comes to uptime, and avoiding the cost of outages.

Screen Shot 2014-12-04 at 9.58.36 AM

Public-cloud services seem to be getting better at managing their cloud services, which includes avoiding outages. At the same time, enterprises continue to struggle with aging equipment, smaller budgets, and more challenging requests from the business. It’s no wonder there are outages. Indeed, I would expect there to be more.

The cloud is not a magic bullet for system uptime. The last thing you want to do is leverage public clouds without a good reason. The value of public clouds for your enterprise should be understood in the context of the business, including capital-cost savings, operational costs, and the cost of outages. You can’t make these kinds of decisions around only one variable, such as outages. The models as to cloud or not to cloud for specific applications, databases, or even enterprises are always complex, with dozens and dozens of variables to consider. Moreover, they vary a lot from enterprise to enterprise, and there are some common patterns that are not general commonality.

So do outages matter? Of course they do, and you should consider outage data when looking at public-cloud providers. However, for the most part public clouds provide you with much better uptime than internal systems, and that’s a fact that most will admit to these days.

As we move into 2015 and 2016 I suspect we’ll see about the same number of outages as public cloud providers, including AWS, Microsoft, and Google, continue to expand. However, considering the growing capacity, the metrics for the service will actually improve, if they keep the outages at about the same number of occurrences or better.

There will be those providers that struggle to maintain an uptime record within a failing business. I suspect those will be the majority of outages next year, and in 2016, as the public-cloud market continues to normalize. The best path is to avoid them, which actually makes their situation worse. You don’t want to go down with those ships, trust me.

Outages make interesting new articles. I’ve written them myself. However, for the most part, they don’t mean much, in terms of any significant downside of leveraging public clouds.