16 Comments

Summary:

Microsoft likes to tout the fact that it runs Windows Azure at data centers worldwide. Yesterday compute instances across most of those regions were disrupted.

windows_azure

Microsoft Windows Azure had a bad day Wednesday with compute capability impacted worldwide throughout the day, as first reported by The Register. Compute service was partially disrupted across nearly all regions,  according to the Azure status page,

At 2:35 a.m. UTC (7:35 p.m. PDT) the company said:

We are experiencing an issue with Compute in North Central US, South Central US, North Europe, Southeast Asia, West Europe, East Asia, East US and West US. We are actively investigating this issue and assessing its impact to our customers. Further updates will be published to keep you apprised of the impact. We apologize for any inconvenience this causes our customers.

A flurry of updates followed throughout the day and as of 10:45 a.m. UTC (3:45 a.m. PDT) on Thursday, the status report declared that the problem had been addressed and the company was “remediating” affected services.

Microsoft said the issue impacted the swap deployment feature of service management. As explained in PC World: Azure offers both a staging environment to let users test their systems, and a production environment separated by the virtual IP addresses used to access them. Swap deployment operations are used to turn the staging environment into the production environment.

Update: What this meant was that while existing applications continued to work, new code could not be deployed to production until the swap deployment issue could be resolved.

Still, there’s no way to paint this in a good light. If compute services fail across regions, even companies that design workloads to withstand snafus in one region, will be impacted. Folks point fingers at Amazon when its U.S. East facility has glitches, but AWS has not fallen down like this across multiple regions to my knowledge (correct me in comments if I’m mistaken.)

azureout

Other than its status page updates, Microsoft has not commented on the glitch — typically it takes some time for the company to issue explanations on its Windows Azure Blog.

This is the second major Microsoft cloud faux pas in the last year: Windows Azure storage was laid low by a so-called “leap year bug” in an expired security certificate in February.

Update: A Microsoft spokeswoman just emailed this update:

“As of 10:45AM PST, the partial interruption affecting Windows Azure Compute has been resolved. Running applications and compute functionality was unaffected throughout the interruption. Only the Swap Deployment operations were impacted for a small number of customers.  As a precaution, we advised customers to delay Swap Deployment operations until the issue was resolved.  All services to impacted accounts have been restored. For more information, please visit the Windows Azure Service Dashboard.”

Note: This story was updated at 7:51 a.m. PDT to reflect that existing applications continued to run but the issue prevented new code from being put into production; and at 9:16 a.m. PDT to characterize the reference to an earlier failure: and again at 10:36 a.m. PDT with a statement from Microsoft.

You’re subscribed! If you like, you can update your settings

  1. Compute Services have not failed. Only the Dashboard to swap staging and production. What was running continued to run, new code cannot be moved to production.
    Not a good light, but a fairer view of what happened

    Md

    1. thanks for your note. I’ve updated the post.

    2. I’m not even sure that’s true. From the sound of it, I don’t think there’s any reason code couldn’t be moved to production. People would have just modified their process slightly to either do a staging deploy to a new instance and then deploy to production or just deploy straight to production or just wait… In either case, this is hardly the huge problem the author paints it to be. This is kind of like the copy & paste functionality temporarily not working on your computer. You may have to type the text twice but its not the end of the world.

  2. >>This is the second major Microsoft cloud faux pas in the last year: Windows Azure was laid low by a so-called “leap year bug” in February.

    That was in 2012, last year. The February 2013 outage was related to expired certificates in the storage service.

  3. How is a February 2012 issue is actually used as “last year” that is more than 20 month ago ?

    1. you are quite right. I mixed up the leap day issue with the failed security certificate issue which happened February 2013. Updated the story. thanks for the notes.

  4. – What this meant was that while existing applications continued to work, new code could not be deployed to production until the swap deployment issue could be resolved.

    VIP Swap is only one way to do an upgrade to production. The other method, in-place upgrade, was unaffected.

  5. Just a jolly good article pointing how MS failed miserably in multiple regions. I myself am learning to be more fair towards the tech giants like MS, but this article just falls on its face in even trying to be honest: it still says “This is the second major Microsoft cloud faux pas in the last year: Windows Azure storage was laid low by a so-called “leap year bug” in an expired security certificate February.” Which sounds vaguely two failures within a year. A reputable tech site like GO should definitely be more honest!

    1. story updated to reflect that the earlier (february) problem was due to lapsed security certifcate, the leap year issue was feb. 2012. I would argue that two major issues in CY 2012 is worth reporting.

  6. This is link-baiting nonsense and hyperbole, and unfortunately seems to be par for your “reporting” and a disservice to GigaOM. Nothing was down in a way that affected end-users of the systems running on Azure. The real net effect was that deploying and swapping the bit was broken at the administration level, which at worst caused a temporary inconvenience for customers who wanted to do the swap at that time.

    If you’re going to shout fire in a theater, at least understand what’s really going on.

    1. Nobody shouting fire here.

    2. Jeff,

      I think the primary concern is that Azure has had its second multi-region failure. Regardless of the severity of the issue, cross region failures are a sign that something is very wrong with Microsoft’s operational practices. Besides global layer-three routing failures it should be essentially impossible for a misstep in one region to affect another.

      I also disagree that the inability to deploy code is a minor issue.

  7. LOL at all the Microsoft shills quibbling with the very valid points raised in the article, But then again, Microsoft isn’t exactly synonymous with transparency, so why should we expect a full account of what happened?

    1. I guess you think that AWS, Google, or others are more transparent with their failures. They will only reveal whatever is absolutely necessary to contain the fire and nothing more. MS has not been the most open tech company out there, for sure, but even those playing the “open” card to lure fanboys are only doing it half decent way.

      Look at how open Android really is and why Amazon had to move mountains to get their Fire tablets manufactured by some noname device manufacturing company and ended up replicating major parts of the Android (as released by Google) code base and playing catch up to get developers on board. Short conclusion: everyone is working for their own interest.

      I am just saying people should be more honest when blaming MS and quit the traditional “throw shit at MS, they will take it!” approach.

  8. “AWS has not fallen down like this across multiple regions to my knowledge (correct me in comments if I’m mistaken.)”

    Technically correct since you say “regions”, but Amazon has failed across *Availability Zones* which are supposed to be the firewall boundaries to contain failures.
    April 2011: …the EBS control plane had no ability to service API requests and began to fail API requests for other Availability Zones in that Region as well.

    http://aws.amazon.com/message/65648/

    1. Yep, exactly the comment I wanted to type!

      Amazon promised that their failure domain is Availability Zones, and most companies used to rely on this, deploying to multiple Availability Zones in a single region to avoid slow cross-region traffic.

      So technically correct, but actually very misleading.

Comments have been disabled for this post