Blog Post

Why we should care about — but not overreact to — cloud glitches

The public nature of public cloud means every hiccup — as much as it gets disclosed — gets scrutinized. And rightly so. Last week’s issue with Microsoft Windows Azure was no exception. While the service degradation affected customers’ ability to move new code from staging to production, it did not affect end-users’ experience. But, the fact that the issue affected so many regions, raised eyebrows. And again, it should have.

At 7:35 p.m. PDT Microsoft’s status page acknowledged that services out of “North Central US, South Central US, North Europe, Southeast Asia, West Europe, East Asia” were impacted and Microsoft sounded the all-clear at 10:45 a.m. PDT.

Outages, partial disruptions, glitches are nothing new to the major cloud providers. While Microsoft, Amazon Web Services, and Google work over-time to convince the world that their respective cloud services are solid, none of them are immune to glitches including  Amazon’s US East woes and this Google App Engine snafu.

None of these events will reassure corporate IT pros considering a cloud move. But then again, most of those people should already know the pitfalls of running their own server rooms and data centers.

The difference between an Azure/AWS/Google outage and a glitch in a company’s own data center is that in the case of public cloud, there’s a status page to check. (How disclosive those pages are is subject to debate, but that’s another story.)

When my area was hit by two Comcast On Demand snafus over the past week all we got was an esoteric error code. Trying to glean details beyond that was a fool’s errand.

Structure Show Podcast

Former Vertica CEO Chris Lynch sounds off on East Coast vs. West Coast big data faceoff and the Hadoopalypse

More cloud news from around the web

From Infoworld: Google Apps: once a leader faces growing cloud app rivals. 

From TechWorld: Oracle moves aggressiely to poach Business ByDesign customers

From NetworkWorld: Gartner: Cloud-based security as a service to take off

4 Responses to “Why we should care about — but not overreact to — cloud glitches”

  1. This is where effective cloud policy can play a pivotal role – indicating, for example, which workloads should be directed to which clouds given their business requirements for security, availability, etc. A central policy engine that controls workload placement based on performance parameters helps assure quality and service levels. Cloud management platforms that have a central policy engine feature operational metrics such as fine-grain visibility into costs, resource consumption, workload performance, and more. For more:
    – Shawn Douglass, CTO, ServiceMesh

  2. It’s getting to the point where outages on the big cloud platforms are usually major events caused by a “perfect storm” of things. Given the level of investment, technology and redundancy even before you start to architect your system across multiple regions, etc it’s usually a huge cascade of problems that result in outages. It’s pretty much like the airline industry where it’s never one thing that happens.

    This is one of the reasons to use those providers – it’s likely to be much more reliable than anything you could build in a single facility. But the key is to architect around failure i.e. using different data centers, multiple zones, and so on.

    • And, how much do we really know about failures in company data centers that we never hear about? I think the thought that a company’s own data center is more robust, secure, bullet proof than a public cloud is based on incomplete data.

      • Agreed. That could really be all attributed to the warm, fuzzy feeling of owning and controlling it. But also perhaps because the cloud providers rarely go into much detail about how everything works under the hood. Indeed, outage post mortems are the rare time you get intimate details of how things actually work.