Behind Microsoft's Latest Windows Azure Outage
Microsoft's Windows Azure cloud storage service went down worldwide late Friday afternoon, just as I was getting ready to call it a week. An expired SSL certificate was the cause of the outage, Microsoft eventually confirmed.
The Windows Azure outage -- which lasted into Saturday -- is ironic, given last week's study that indicated Windows Azure storage offered the fastest response times out of five large cloud networks, beating those operated by Amazon Web Services, Google, HP and Rackspace. Good thing for Microsoft that Nasuni, the vendor that ran the study, wasn't testing Windows Azure this weekend.
Once the service was back up Saturday, I posted an update noting that Microsoft had fixed the problem and users could once again access their data. The company said the service was 99 percent available early Saturday and completely restored by 8 p.m. PST. But the damage was already done -- and many customers and partners were furious.
In comments posted on a Windows Azure forum, Sepia Labs' Brian Reischl, who first pointed to the SSL certificate as the likely culprit, seemed to feel users should cut Microsoft some slack. Reischl said letting an SSL certificate fall through the cracks is a mistake anyone could make. "I know I have. It's easy to forget, right?" he posted. "It's an amateur mistake, but it happens. You end up with some egg on your face, add a calendar reminder for next year, and move on."
But one has to wonder how Microsoft, which has staked its future on the cloud and has spent billions to build Windows Azure into one of the largest global cloud services, could not have put in safeguards to prevent the domino effect that occurred when that cert expired -- much less have a mechanism in place to know when all certificates are about to expire. Putting it in admins' Outlook calendars would be a good start.
Of course, there are more sophisticated tools to make sure SSL certificates don't expire. Among them are Solar Winds' certificate monitoring and expiration management component of its Server & Application Monitor, a favorite among readers of our sister publication, Redmond. Another option not so coincidently hit my inbox this week: Matt Watson, founder of Stackify, spent a few hours over the weekend developing a free tool called CertAlert.me, which allows site owners to scan the Web sites they own and track SSL and domain name expirations.
"It happens a lot," Watson told me in a brief telephone conversation regarding outages like the one that struck Friday, which affected Stackify. "All you can do is sit on your hands and pray," he said, adding that years ago he had to deal with an expired SSL certificate. "You buy them and you forget about them and the next thing you know, your site's gone. It's one of those things that get overlooked."
Asked what's the business opportunity for offering this free service, Watson said he saw it as an opportunity to bring exposure to his startup's namesake offering, a Windows Azure-based server monitoring platform targeted at easing access for developers while ensuring they don't have access to production systems.
Indeed, you can bet Microsoft is going to ensure it doesn't happen. "Our teams are also working hard on a full root cause analysis (RCA), including steps to help prevent any future reoccurrence," said Steven Martin, Microsoft's general manager of Windows Azure business and operations, in a blog post apologizing for the disruption. Given the scope of the outage, Microsoft will offer credits in conformance with its SLAs, Martin said.
This is not the first outage Microsoft has had to explain and probably won't be the last. And we all know the number of well-publicized outages Amazon Web Services has encountered in recent years.
If you're a Windows Azure customer, did last week's slip-up erode your confidence in storing your data in Microsoft's cloud? Drop me a line at [email protected] or leave a comment below.
Posted by Jeffrey Schwartz on 02/26/2013 at 12:48 PM