Leap Year Bug Causes Hours-Long Windows Azure Outages (UPDATE)
Microsoft has posted a detailed explanation of what caused the Feb. 29 outage here, and says it will give Windows Azure Compute, Service Bus, Access Control and Caching customers a 33 percent credit on their monthly bills "regardless of whether their service was impacted."
At 5:25 p.m. PST on Wednesday, Microsoft said it has "restored full service management functionality for all Windows Azure hosted services in the North Europe sub-region. We have restored full service management functionality for most customers in the South Central US sub-region."
Bill Laing, corporate vice president of Server and Cloud at Microsoft, summarized the outages in a blog post late Wednesday. "Yesterday, February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions," he wrote. "The issue was quickly triaged and it was determined to be caused by a software bug. While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year."
Additionally, the Azure Marketplace, which some users were prevented from accessing due to a problem with Azure Compute, is now "fully functional," Microsoft reports.
The original article follows.
Portions of Microsoft's Windows Azure cloud platform are still crippled from a series of outages that has lasted several hours and affected users in multiple countries.
Azure Service Management has been down worldwide since Tuesday evening, according to Microsoft's Windows Azure Service Dashboard. "We are experiencing an issue with Windows Azure service management. Customers will not be able to carry out service management operations," Microsoft said in a Dashboard update.
In subsequent updates, Microsoft said the cause of the outage was a "a cert issue triggered on 2/29/2012 GMT," which Web site Data Center Knowledge speculated could be a "date-related glitch with a security certificate triggered by the onset of the Feb. 29th 'Leap Day' which occurs once every four years."
Microsoft said the issue affected fewer than 3.8 percent of hosted services and that "there is no impact on storage accounts."
Microsoft began rolling out a hotfix early Wednesday morning. At 5:30 a.m. PST, Microsoft reported, "The issue is mitigated and service management is restored for the majority of customers. We still need to work through some issues before we can completely restore service management."
Some customers in Europe and the United States continued to have Service Management problems throughout the morning. As of 11:30 a.m. PST, the time of the last Dashboard update, Microsoft said it is still "actively recovering Windows Azure hosted services in the North Central US, South Central US and North Europe sub-regions. More and more customers applications should be back up-and-running even if service management functionality is not yet restored."
The company also on Wednesday reported ongoing performance degradation problems with the Azure Compute service in parts of the United States and in Northern Europe. According to the Dashboard, this problem began around 2:55 a.m. PST and resulted in some hosted services in those areas not receiving incoming traffic.
"This incident impacts Access Control 2.0, Marketplace, Service Bus and the Access Control & Caching Portal in the same regions where Windows Azure Compute is impacted," Microsoft reported. "As a result affected customers may experience a loss of application functionality."
Additionally, Microsoft said the Azure Compute problem prevented some U.S. customers from logging into the Azure Marketplace or signing on to OAuth.
By mid-Wednesday, Microsoft says it had recovered more than half of the hosted services affected by the Azure Compute issue, but as of 2:30 p.m. PST, "recovery efforts are still underway."
Besides the services already mentioned, other Azure services remain "unavailable" for some customers as of this writing:
- SQL Azure Data Sync for parts of Asia, the United States and Europe has been down since Tuesday, 12:00 a.m. PST.
- SQL Azure Reporting for parts of Europe has been down since Tuesday, 12:00 a.m. PST.
Microsoft said it is troubleshooting both problems. "Further updates will be published to keep you apprised of the situation. We apologize for any inconvenience this causes our customers."
Gladys Rama is the senior site producer for Redmondmag.com, RCPmag.com and MCPmag.com.