Designing for Cloud Failure - A Q&A with Riverbed's Apurva Dave -- Virtualization Review

Designing for Cloud Failure - A Q&A with Riverbed's Apurva Dave

Q&A: Just because your data is in the cloud doesn't mean it's always available. Riverbed's Apurva Dave has advice on guarding against cloud outages.

By James E. Powell
12/12/2012

Moving your application offers several benefits, but what happens when your cloud provider's service goes down? IT is waking up to the implications of application interruptions. ESJ.com Editorial Director James Powell spoke to Apurva Dave, vice president of products and marketing at the Stingray business unit at Riverbed, a company that helps enterprises implement strategic initiatives such as cloud computing and disaster recovery.

Enterprise Strategies: How has the cloud, specifically recent high-profile outages such as Amazon's, changed executives' view of disaster recovery and business continuity?
Apurva Dave: It's changed executives' views in two ways. First, it has made them more aware of additional processes that may need to be put in place for effective disaster recovery. Second, it has shown them how quickly business continuity can be affected by the cloud.

It's also left some questioning the inherent benefits of the cloud, leaving them to wonder, "Can I do better on my own?" "Should I keep the ownership and operation of my data center infrastructure in-house?"

The short answer is no. Despite major cloud outages, the truth is that major cloud providers offer better uptime and business continuity than organizations can achieve on their own.

What proactive steps can organizations take to prepare themselves for unplanned cloud outages?
Designing for failure is one of the most important steps an organization can take to prepare for an unplanned outage. Within this model, combinations of your software and management tools take responsibility for application availability. The goal is to build an infrastructure that is resilient to failure and ensures each system can stand on its own so that in the event of an outage, you maintain 100 percent uptime and business continuity.

What are the most common mistakes IT professionals make when faced with a cloud outage?
The most common mistakes happen well before the outage occurs when IT professionals don't plan proactively and when they don't design for failure. Here are three of the most common mistakes:

Assuming your data is backed up. Never assume someone else is protecting your organization; make sure your provider has a back-up or disaster recovery plan in place.
Relying solely on one cloud provider. Many companies entrust their entire infrastructure to one provider's cloud, so when that cloud goes down, so do they. To avoid this, it's best to deploy your cloud across multiple availability zones within that cloud or across cloud providers.
Avoiding redundancy. In the event of a cloud outage, when critical data can be lost, it's important to have duplicate copies of various data, equipment, systems, or all of the above, on multiple computers or units in the data center.

What benefits can organizations see by exposing themselves to failure early on?
Exposing your organization to failure early on allows you to learn quickly from mistakes and to build an infrastructure that can withstand any public or private cloud failure while still being resilient. By taking preemptive measures early and often, outages can be addressed head on with minimal business continuity disruption or impact on the bottom line.

When designing for failure, is it better to start off with a public or private cloud?
It really depends on your business model and the industry you're in. Organizations that have to adhere to strict compliance regulations or have a large number of mission-critical applications that require enterprise-class infrastructure, such as financial or health-care organizations, will probably opt for a private cloud. On the other hand, organizations that require high-availability, scalability, and elasticity (such as an e-commerce or online gaming company) are more likely to start off in a public cloud.

What is an "availability zone" and how does it come into play when designing for failure?
Availability zones (AZs) are distinct locations engineered to be insulated from failures in other zones and are provided with low latency network connectivity to other AZs in the same region. In the event of a regional disaster, a key way to ensure an application will remain available is to host it in multiple AZs. This concept is "cross-AZ balancing" and is a key benefit we explain to potential customers of using our product, Riverbed Stingray Traffic Manager (STM), in Amazon, because the solution's load balancing capability diverts traffic to an available data center during an outage and its software design allows it to move more easily.

By having your application instances in separate AZs, users can be instantly redirected in real-time to another zone if one goes down. If the secondary zone is far from the end user, performance may be slower, but your service will be up and running. The secondary zone's infrastructure can be kept small and then auto-scaled up in the event of failover. Leveraging AZs is just one of the many ways to design for failure, in that you're providing a way to maintain application availability and business continuity, even if the primary host cloud should fail.

What is "cloud balancing" and what benefits do organizations often see from it?
Cloud balancing is a similar concept to leveraging AZs, but instead of only using a single cloud provider's infrastructure, you're balancing your applications across multiple providers. For example, you may use a combination of AWS with RackSpace, GoGrid, and IBM. Cloud balancing assumes that an organization has an application delivery infrastructure that is hyper-portable across clouds so that you can ensure that all functionality that you implement in your application delivery controller (ADC) is available in all locations.

Cloud balancing provides more than just a security blanket in the event of a cloud outage, enabling an organization to:

Develop an application that is battle tested across multiple cloud platforms
Benefit from different service-level agreements (SLAs) and different data center locations
Provide leverage in case an organization needs to shift cloud strategies in the future
Increase reliability of cloud-based infrastructure, while improving application performance

What products or services does Riverbed Stingray offer that are relevant to our discussion today?
Riverbed Stingray Traffic Manager enables organizations to take advantage of the scalability and improved reliability that cloud load balancing techniques deliver. Organizations can us it to deploy applications across multiple locations and environments, and it addresses some of the technical and organizational challenges they face, such as reducing risk and containing costs by enabling the use of multiple data centers (including cloud resources) and by routing and shaping traffic and dynamically scaling applications to provide the capacity required by the current load. It also helps enterprises gain a global perspective of application performance and reliability across multiple locations, with detailed visualization and reporting to understand traffic trends and manage user interactions in real time.

About the Author

James E. Powell is the former editorial director of Enterprise Strategies (esj.com).