Updated: Microsoft Azure went down Monday night and stayed down for hours. That news comes as Microsoft is trying to pitch two-year-old Azure as a safe and reliable platform for consumer and business applications.
As first reported by The Register, which cited Microsoft’s service page, Azure’s service management system started acting up just before 6 p.m. Pacific time, and the problem persisted for at least 12 hours. Microsoft’s Service Dashboard showed that two Access Control areas — in Northern Europe and South Central U.S. — were disrupted as of 10 a.m. Pacific time Tuesday.
In a statement, Microsoft said it became aware of an issue impacting Windows Azure service management in some regions at 5:45 p.m. It deployed a fix that worked for most customers, but that three regions — North Central U.S. and South Central and North Europe – remain affected while Microsoft continues to work on the issue. A spokeswoman referred users back to Microsoft’s service dashboard for updates.
Cloud outages happen, but they are never good. The fact that there was not a greater hue-and-cry about this snafu earlier than Tuesday morning indicates to some observers that there are not that many Azure users out there. Amazon Web Services was disrupted for several days last April, and that gap generated considerable buzz.
Big disruptions like these could spook companies that are still wary of entrusting their work to off-site clouds. But, given the economics of cloud computing — which is generally less expensive than deploying servers and software in house — there will still be pressure to go to the cloud.
Update: The Azure meltdown was apparently sparked by a Leap Day error in software, according to Wednesday afternoon blog post by Bill Laing, corporate VP of Server and Cloud for Microsoft. Laing wrote:
Yesterday, February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions. The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year. Once we discovered the issue we immediately took steps to protect customer services that were already up and running, and began creating a fix for the issue. The fix was successfully deployed to most of the Windows Azure sub-regions and we restored Windows Azure service availability to the majority of our customers and services by 2:57AM PST, Feb 29th.