The Cloud Report

Blog archive

Clouds Fail: Deal with It

When you're high in the sky, you have a long way to fall, and the trip down can be spectacular.

Take airplanes, for example. Study after study has concluded that airline travel is statistically safer than driving your car, but when crashes happen, they get everyone's attention.

So it is with cloud computing. This week's Microsoft Azure outage just blew up the tech newswires, and no doubt it was a serious issue. Heck, these days, I get frustrated when someone doesn't return an IM or text within a couple minutes. I can't imagine the angst of those running mission-critical business applications in the cloud seeing their systems go south for hours.

Yet those statistics remain. According to an IDC study last fall:

The cloud solution also proved to be more reliable, experiencing 76 percent fewer incidents of unplanned outages. When outages occurred, the response time of the cloud solution was half that of the in-house team, further reducing the amount of time that IT users of the services supported by the information governance solutions were denied access. Overall, the combination of fewer incidents and faster response times reduced downtime by over 13 hours per user per year at a cost of $222 per user, a savings of 95 percent.

A more recent report by Nucleus Research about cloud leader Amazon Web Services (AWS) concluded:

Although cloud services provider outages are often highly publicized, private datacenter outages are not. Our data shows customers can gain significant benefits in availability and reliability simply by moving to a cloud services provider such as AWS.

And it seems to me reliability has improved with the great cloud migration. Years ago my work used to be interrupted so often that the phrase "the system is down" became a cliché. I would hear the same thing all the time at some doctor's office or store: "Sorry, we can't do that right now -- the system is down." It's probably no coincidence that the rock band System of a Down was formed in 1994.

So you have to deal with it. You can bet there's hair on fire at Microsoft these days, and there will be plenty of incentive to diagnose the recent problems and fix them and improve the company's cloud service reliability.

And then something else will happen and it will go down again.

No matter how many failover, "always on" or immediate disaster recovery systems are in place, there will be outages. So you just have to help mitigate the risks.

Of course, the cloud providers want to help you with this. As Microsoft itself states about its Windows Azure Web Sites (WAWS), you should "design the architecture to be resilient for failures." It provides tips such as:
  • Design a risk-mitigation strategy before moving to the cloud to mitigate unexpected outages.
  • Replicate your database across multiple datacenters and set up automated data sync across these databases to mitigate during a failover.
  • Have an automated backup-and-restore strategy for your content by building your own tools with Windows Azure SDK or using third-party services such as Cloud Cellar.
  • Create a staged environment and simulate failure scenarios by stopping your sites to evaluate how your Web site performs under failure.
  • Set up redundant copies of your Web site on at least two datacenters and load balance incoming traffic between these datacenters.
  • Set up automatic failover capabilities when a service goes down in a datacenter using a global traffic manager.
  • Set up content delivery network (CDN) service along with your Web site to boost performance by caching content and provide a high availability of your Web site.
  • Remove dependency of any tightly coupled components/services you use with your WAWS, if possible.

But all this has been said before, many times, and some people disagree, saying it's time to hold the cloud providers more accountable. Take, for example, Andi Mann, who last year penned the piece, "Time To Stop Forgiving Cloud Providers for Repeated Failures." The headline pretty much says it all. Mann goes into detail about the issue and writes:

We cannot keep giving cloud providers a pass for downtime, slowdowns, identity thefts, data loss, and other failures.

It is time for all of us to stop excusing cloud providers for their repeated failures. It is time we all instead start holding them accountable to their promises, and more importantly, accountable to our expectations.

There has even been an academic research paper published concluding "that clouds be made accountable to their customers."

I personally believe that anything made by humans is going to fail, and all we can do is try to prepare for this inevitability and pick up the pieces and keep going when it's over. Some agree with me, such as David S. Linthicum, who wrote an article on the GigaOM site titled, "Are We Getting Too Outage-Sensitive?" Mann, who was partially responding to that article, obviously disagrees.

And, indeed, it's been a tough week for Microsoft. Fresh on the heels of reporting cloud market share gains, Visual Studio Online experienced some serious problems and this week's security update was a disaster.

So what's your take? Are Microsoft and other cloud providers doing enough? Should cloud users take more responsibility? Comment here or drop me a line.

Posted by David Ramel on 08/20/2014 at 11:23 AM


Subscribe on YouTube