11 Comments

Summary:

Azure, Microsoft’s platform-as-a-service cloud, went down Monday night and stayed down for at least 10 hours. The news comes as Microsoft is trying to pitch two-year-old Azure as a safe and reliable platform for consumer and business applications.

3346648077_eb5877f6f5_z

Updated: Microsoft Azure went down Monday night and stayed down for hours. That news comes as Microsoft is trying to pitch two-year-old Azure as a safe and reliable platform for consumer and business applications.

As first reported by The Registerwhich cited Microsoft’s service page, Azure’s service management system started acting up just before 6 p.m. Pacific time, and the problem persisted for at least 12 hours.  Microsoft’s Service Dashboard showed that two Access Control areas — in Northern Europe and South Central U.S. — were disrupted as of 10 a.m. Pacific time Tuesday.

In a statement, Microsoft said it became aware of an issue impacting Windows Azure service management in some regions at 5:45 p.m. It deployed a fix that worked for  most customers, but that three regions — North Central U.S. and South Central and North Europe – remain affected while Microsoft continues to work on the issue. A spokeswoman referred users back to Microsoft’s service dashboard for updates.

Cloud outages happen, but they are never good. The fact that there was not a greater hue-and-cry about this snafu earlier than Tuesday morning indicates to some observers that there are not that many Azure users out there. Amazon Web Services was disrupted for several days last April, and that gap generated considerable buzz.

Big disruptions like these could spook companies that are still wary of entrusting their work to off-site clouds. But, given the economics of cloud computing — which is generally less expensive than deploying servers and software in house — there will still be pressure to go to the cloud.

Update: The Azure meltdown was apparently sparked by a Leap Day error in software, according to Wednesday afternoon blog post by Bill Laing, corporate VP of Server and Cloud for Microsoft. Laing wrote:

Yesterday, February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions.  The issue was quickly triaged and it was determined to be caused by a software bug.  While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year.   Once we discovered the issue we immediately took steps to protect customer services that were already up and running, and began creating a fix for the issue.  The fix was successfully deployed to most of the Windows Azure sub-regions and we restored Windows Azure service availability to the majority of our customers and services by 2:57AM PST, Feb 29th.

Photo courtesy of ToddABishop.

You’re subscribed! If you like, you can update your settings

  1. Who is using Azure as the core of their online presence? Anyone big?

    Their website shows some large companies using it for odds and ends, but I’ve heard that even Microsoft isn’t using it for their web properties. Where is the evidence that Azure is more than a multi-billion-dollar press release?

    1. I used to work for Microsoft, and certain components of MSDN and TechNet (where I worked) absolutely run in Azure. And keep in mind that internally, it’s like any other company. You don’t start just rebuilding stuff and incurring expense just because.

      Generally speaking, it’s a fairly robust platform, and I’d trust my own business on it. In the two years I’ve been familiar with it, this is the first problem that I can recall. As was the case with Amazon, I’m sure there’s a lot that they’ll learn from it, and be able to prevent future problems.

    2. We have dozens of Fortune 500, Government and Enterprise customers who are using our solution with Azure Storage. Many of these customers are some of the best known brands in the world. Most of these companies are not the ones who make their names disclosed publicly in terms of technologies they use. You can some of our public case studies at http://www.storsimple.com/cloudsuccess

      Microsoft Azure BLOB Storage Service was not impacted by this set of events and so none of our customers using Azure were impacted. Even last April during the Amazon outage the Amazon Simple Storage Service (S3) was not impacted either.

  2. economy of scale is one of the premises of cloud, but scale also brings a certain kind of systemic vulnerability.

  3. Paddy Srinivasan Wednesday, February 29, 2012

    As with most public cloud systems, Azure also experienced its day in the dark. Having spent the last 3+ years in this ecosystem, I can say that most of Azure users are in the long tail. Startups ,upstart efforts from large enterprises etc and this outage went under the radar for the most part due to the long tail nature of the user base.
    As with most apps deploying to public clouds, a good monitoring and auto-healing system is critical to ensure timely notification and recovery.
    We have outlined that in our blog here http://www.opstera.com/blog/

  4. Who is using Azure as the core of their online presence? Anyone big?

    1. Yep – crickets. Spin replies, but nothing that answers the question directly.

  5. What sort of geek-incompetents don’t make allownce for Leap Day? Idiots.

  6. What sort of geek-incompetents don’t make allowance for Leap Day? Idiots.

    1. thank you for reminding me to update this story!

Comments have been disabled for this post