Disaster Recovery in the Cloud: Prioritizing Your Infrastructure -- Virtualization Review

Disaster Recovery in the Cloud: Prioritizing Your Infrastructure

Your virtualized infrastructure can be a bit more complex than traditional client/server systems; thus, it requires new thinking when testing and eventually implementing your DR plans. Second in a series.

By Eric Beehler
02/07/2012

Today's systems are not as simple as they used to be. With the advent of a service-oriented architecture permeating our systems, the dependencies reach far from a single system. The old days of single server/client architecture are gone, and it takes more effort to understand the more complex infrastructures. With that, today's disaster recovery planning isn't just about assigning personnel and decided on a failover facility, it is also an exercise in understanding your entire IT infrastructure.

Most companies have a mix of commercial off-the-shelf applications and custom or semi-custom applications that have been updated and modified over time for the environment. The question is, do you know what your applications need to survive? Too often, customized code is victimized by poor documentation and changing scope. When the time comes to recover an application, what if you've come to realize that that application was dependent on some small Web service in your environment that was not deemed critical at the time? The entire disaster recovery can become a failure due to the oversight, regardless of the careful planning that went into the effort beforehand.

What you must do is understand your systems at much lower level than servers and networks. Following the ITIL model and creating a configuration management database (CMDB) can help you document the detail of your most critical line of business applications. Documenting the service dependencies is especially important in this effort. If you don't understand the service contracts and the responsibility assignment of those services, you'll likely reach a point in the processing of a transaction where an assumed service is not available, breaking the entire functionality of the application.

Next, you must prioritize those services. Identify those critical to data integrity, line of business applications, and anything that would prevent you from making money, servicing customers, or fly in the face of regulatory compliance. As you are prioritizing, begin to understand the requirements for your recovery point objective (RPO), which is the point in which you will recover your application and data, as well as define your recovery time objective (RTO), which is the time is takes to stand up the service. These are not just questions for IT, but also for key business consumers who depend on these apps.

When you have your RPO/RTO discussion, be sure you avoid the business' natural instinct to put unrealistic expectations of instantaneous recovery with no data loss onto your entire infrastructure. DR isn't like flipping a switch. Once you have a realistic definition of RPO and RTO, you can begin to put your applications and services into specific categories.

Example of RTO Tiers

RTO	Tier Definition
Tier 1	Fault tolerant with no appreciable impact to the end user if a system goes down.
Tier 2	Unavailable less than 24 hours
Tier 3	Unavailable less than 48 hours
Tier 4	Unavailable between 2 and 7 days
Tier 5	Unavailable more than 7 days

Example of RPO Tiers

RTO	Tier Definition
Tier A	No data lost
Tier B	Less than 24 hours of data loss
Tier C	Data loss up to the last backup (usually 24 to 36 hours of data loss)

There are many solutions to disaster recovery, and there are ways to keep your enterprise up and running with a near instantaneous switch, but no one has unlimited time, money or resources. This is where your RPO and RTO requirements will become a translation from requirements to implementation. There will be certain times when you need a hot site, maybe a whole data center, with big network pipes to replicated block-level data between SANs and transaction-level synchronization in your databases, but using this kind of expensive technology for a file server is usually an overkill. Only tier 1 applications should use these kinds of solutions.

Some applications may have to be rewritten to take advantage of multi-site redundancies and failover, which often is not financially possible. When applications are designed for DR, differences in network addresses, database server names, or configuration file changes are too often not factored into the detail. These tend to get flushed out in an exercise, but it's an expensive waste of time and resources to discover that a simple text file change that's pointing to a different database name is what caused a large-scale DR exercise to fail.

One common mistake usually found in a disaster recovery exercise is the assumption that basic infrastructure services exist and are functioning properly. Active Directory, DNS, network VLANs, time services and permissions are all areas that system administrators don't concern themselves with most days, because they are up and working. Even though planning will include basic networking and other concerns like providing servers for Active Directory authentication, the complexities of restoring or creating these infrastructure pieces from scratch is often overlooked and can contribute to failed RTO goals.

Also be wary of the magical solutions of disaster recovery. Although virtualization enables flexibility like virtual machine replication, it doesn't necessarily reproduce your entire infrastructure. Just because a server is up, it doesn't mean data and network traffic is following to those servers.

The key to a successful disaster recovery is documenting the details of your services and infrastructure. The more effort you put into that documentation and subsequent prioritization of RTO and RPO for your organization, the better you can plan and handle troubleshooting during an actual DR.

About the Author

Eric Beehler currently has certifications from CompTIA (A+, N+, Server+) and Microsoft (MCITP: Enterprise Support Technician and Consumer Support Technician, MCTS: Windows Vista Configuration, MCDBA SQL Server 2000, MCSE+I Windows NT 4.0, MCSE Windows 2000 and MCSE Windows 2003). He has authored books and white papers, and co-hosts CS Techcast, a podcast aimed at IT professionals. He now provides consulting, managed services and training through his co-ownership in Consortio Services LLC.