Microsoft just updated its explanation of what brought down Windows Azure in Europe for nearly 2 and a half hours last week.
In a blog post, Windows Azure General Manager Mike Neil basically said that when Microsoft added more compute capacity to meet increased demand in its West Europe sub-region, it did not match that new capacity with enough network devices to handle the additional connections needed. Because of the imbalance of compute-to-network devices, the “connection threshold was exceeded and that increased management traffic, [which] in turn, triggered bugs in some of the cluster’s hardware devices, causing them to reach 100 percent CPU utilization impacting data traffic,” Neil wrote.
Microsoft posted its first, limited, explanation of the outage the day after it happened and promised another update in the upcoming week. Six days later this post filled in some more details.
Microsoft, Amazon, HP and other companies trying to win the trust and workloads of more business customers need to be transparent and forthcoming about operational issues as they arise. Microsoft, which is trying to push Azure as an enterprise-class public cloud that can compete with the Amazon EC2 behemoth, is under a microscope, as is Amazon itself after a couple of widely publicized outages this summer.