Dan's Take

Disaster Planning Lessons From the Amazon S3 Outage

Don't blame it all on Amazon.

Yesterday, Amazon's S3 cloud storage service suffered a well-publicized outage. Many companies who use that service discovered that their Disaster Recovery (DR) plans were either up to the challenge, or failed in disgrace. One of my insurance companies, for example, had just issued a request for customers to update something in their account information the day before the outage and customers couldn't log in. A later message blamed AWS S3.

Although my insurance company, like many others, was quick to blame Amazon, I know who the real culprit was: the insurance company itself. This was a real-world test, and the company's DR plans clearly failed. While it's human nature to disparage the supplier, it would be wise for companies to look to their own plans and IT implementation before they publically point the finger at one of their suppliers, like Amazon.

I'm not acting as an apologist for Amazon. While they have a fairly good track record for uptime, no cloud service provider is perfect.

What Constitutes a Disaster?
While the AWS S3 outage might have been a major contributing factor to the visible failures, that outage, in itself, wasn't the entire cause. The lack of planning and poor execution is a leading candidate for the real cause of a visible outage.

Planning for disasters involves many levels of business, facilities and IT management as well as experts in systems software, virtualization technology, application frameworks, application development, database management, and storage and networking. If an outage in one component -- say S3, for example -- forces a cascade of other failures, resulting in the loss of a critical application, it's clear that someone, somewhere didn't have an adequate plan in place, or that the plan that was in place wasn't executed well. Let's consider many of the different ways a customer might have prepared for this type of event.

Processing:

  • Hardware that supports continuous processing -- that is non-stop, fault-tolerant computers, such as the ftServer from Stratus Technologies, for example -- might be an answer to how to keep processing from failing. These systems are designed with multiple layers of redundant hardware and special firmware that detects failures and moves processing to surviving system components. Failover takes only a number of microseconds and is automatic. Having a backup system based on this type of hardware might be a good answer for critical applications.

  • Clusters of systems designed to detect slowdowns or failures and move applications and/or data to maintain continuing operations. Suppliers such as Dell, HPE, IBM, Microsoft, Oracle, Red Hat, SUSE and many others offer this type of DR solution. Cluster software managers monitor the health of systems, applications, and application components and moves functions to another system when a failure or slowdown is detected. Some cluster managers make it possible for instances to reside in the cloud to create a hybrid computing environment. Onsite systems could act as warm standbys to pick up workloads if the cloud supplier's network becomes unavailable. Some products in this category make it possible for remote nodes to be part of different cloud computing environments, allowing failover in any direction: cloud to on-premises, cloud to other cloud, or on-premises to cloud.
  • Virtualization and function migration. Suppliers of virtual machine software (VMware, Citrix, Microsoft and open source communities); operating system virtualization and partitioning (i.e., containers); network virtualization (i.e., software-defined networking or SDN); and storage virtualization all offer ways to migrate functions from a failing system to a more healthy environment. Many of these suppliers support migration to and from major cloud computing environments.

Storage:

  • Storage systems that keep multiple copies of each data item, making it possible for applications to continue accessing and updating these data items even though some component has failed. Suppliers such as EMC, HDS, NetApp, and many others include this capability in their storage servers. Replication software that can keep copies of data items in several places is available; it executes either in the storage server itself or in host systems attached to the storage server. In the case of a failure, operations staff can point applications to data items in another location. Some products will automatically redirect storage requests rather than requiring manual intervention. This would make it possible for local copies of data be pressed into service when cloud storage becomes unavailable.

  • Storage virtualization software that, like storage hardware, keeps multiple copies of each data item. This approach, offered by suppliers such as DataCore, Veeam, Citrix/Sanbolic and others, also make it possible for the data to be replicated to other data centers or cloud services. This software can hide failures and allow applications to continue to execute by failing over to other copies of the data.
Dan's Take: Don't Plan to Fail
It's clear from the number of companies reporting application or entire site failures due to Amazon's outage that there are a few red faces among the companies' IT staff. Their failure planning didn't achieve its goal, and that failure was obvious to everyone.

Moving to a cloud-computing solution doesn't mean that organizations are relieved of all responsibilities. It's crucial to monitor what's happening with cloud computing services and applications in order to detect potential failures and move applications, data or both to safe havens.

"Fail to plan is planning to fail" is a quote especially relevant today. IT designers must plan for system, memory, storage, network and application component failures. We all can see what happens when either there was no planning or the plans didn't work.

About the Author

Daniel Kusnetzky, a reformed software engineer and product manager, founded Kusnetzky Group LLC in 2006. He's literally written the book on virtualization and often comments on cloud computing, mobility and systems software. He has been a business unit manager at a hardware company and head of corporate marketing and strategy at a software company.

Featured

Subscribe on YouTube