Microsoft Azure Cloud Failure Actually Not a Leap Day Issue

In this day and age, there are varying views on open source adoption, especially in large-scale enterprise IT operations. While people still trust their large vendors for a variety of reasons, that line of thinking is very much outdated – quite like their historical operations strategy: closed and proprietary. Although most big vendors are attempting to start anew with more open strategies, their narrow thinking has, until recent years, made them blind to potential disaster. Case and point, Microsoft’s Azure total cloud failure last week.

Last week Wednesday, I started to see traffic indicating that Microsoft’s Azure cloud service was having issues. Indeed the Microsoft Windows Azure Service dashboard was lit up with yellow dots showing several regions had limited service availability. While talking this over with my colleague Chet, we tried to come to a reasonable explanation how a global outage could possibly happen. Could it be a rough admin pushing an update to all the systems? Not likely since they would require global access and I’m almost certain Microsoft wouldn’t allow that to happen. Could it be a DNS issue? It could be but it would have to be a very serious problem to impact the global architecture so quickly. The only thing we could come up with was a software bug dealing with the leap year. So I tweeted around 11am my thought that it could be a leap year issue.

It wasn’t very hard to come to that conclusion, but committing to it was a little harder. Could Microsoft have really overlooked such a simple problem? All the code reviews and processes they follow should have caught something so simple. But in the end, it was a software bug dealing with the leap year calculation for their compute service. Within 9 hours they had a solution and started to deploy it out across the network.

9 hours to find, develop, and test a fix that should have been caught during the hundreds if not thousands of code reviews. In my opinion, this outage says much more about Microsoft’s inability to effectively manage their software than it does about their cloud offering.

Customers who thought they were doing the right thing by load balancing their applications across availability zones no longer can feel comfortable when staying with one vendor. In the cloud world, customers put a mountain of faith into their cloud provider and when the provider deploys closed source software, they should be very worried.

Had Microsoft open sourced the Azure product, they would have benefited by the crowd-sourcing development value this provides. Instead they stuck with what they know best – internal development efforts.

While I can’t predict the actual outcome of this same event had Microsoft open soured Azure, I can make a well-rounded assumption had this been an OpenStack deployment, or any number of other open source cloud software products, customers would have had a fighting chance of finding the problem earlier.

Azure is back online now and much like Amazon’s “lightning strike” last year, we’ll all analyze this for months to come. But in the end, it’s all part of growing pains for cloud providers. Operating large scale cloud deployments for a massive customer base with any kind of hiccup will undoubtedly impact hundreds of thousands (if not millions) of users. The value these providers need to focus on is supplying a resilient infrastructure and not the software value. Allow the masses to help find flaws on the software part to provide a more resilient infrastructure service.

Image via