In-Depth

The Quest for Guaranteed Recovery Assurance

You probably have a disaster recovery plan in place. But do you know that it's ready if your worst-case scenario occurs?

Most businesses have a disaster recovery (DR) plan. But unbeknownst to those companies, many of those plans will never work. That's not a problem if the business never experiences data corruption, or if an employee never walks off with a server, or if no tornado, hurricane, or earthquake ever strikes. The question is: Do you want to take the chance of these things happening and turning into a nightmare scenario? Because having a DR plan isn't enough: You need to know that it's going to work when you need it. The only way to do that is by periodically testing your DR plan with recovery assurance (RA) testing.

Business continuity (BC) allows the business to continue to generate revenue no matter what happens, and is the driver for data protection and DR. DR plans should counter a multitude of threats to applications and systems, ranging from relatively minor data loss or equipment failure, to a major natural disaster such as flooding or a hurricane.

Having an effective DR plan for these circumstances doesn't mean just a nicely written plan on which someone in IT spent a lot of time. It does mean that the DR plan is in place, that it covers applications by service levels and that it's guaranteed to work. This is a matter of basic business survival; losing your compute capabilities for any significant length of time constitutes a disaster -- and the definition of "significant length" gets shorter all the time.

You've Got Choices
Virtualization allowed IT to consolidate workloads by running applications in multiple virtual machines (VMs) on top of shared server hardware, making better use of available resources. These same benefits apply to DR.

In the "old days," IT would have to set up physical servers running dedicated applications at a DR site, mirroring what was running on the production floor. Now, they can keep applications running as VMs on generic hardware, with much more flexibility and at a lower cost, making comprehensive DR more feasible than back in the pre-virtualization, pre-cloud days.

However, while DR is now more feasible and more important than ever, it's also more complicated. Companies now demand tighter recovery point objectives (RPOs) and recovery time objectives (RTOs), applications running over multiple VMs, load balancing, boot order and dependencies. Other big complicating factors include the sheer size of data and backing up from multiple sites to multiple locations, including remote sites and the cloud.

Ultimately, RPO and RTO must rule the disaster recovery roost: RPO for the maximum amount of data that can be lost without significant business loss, and RTO for the maximum amount of time that an application can be down without significant business loss. Let's look at how well different solutions ensure acceptable RPO and RTO:

  • Do nothing: This "solution" is more common than you might think. It's generally a combination of the unwillingness or inability to do serious DR testing, the fervent hope that nothing really bad will happen, and the over-optimistic belief that even if there is a disaster, the environment can recover before any serious damage is done.
  • Restore from off-site backups. This is the most elementary of DR plans. It's workable in smaller environments with generous RPO and RTO, and a tested restore plan such as a contract with an off-site company to deliver data on removable media within 12 hours of an outage. However, the process is highly manual and error-prone, takes hours to restore from tape or optical drives, and may require rebuilding servers and storage from bare metal.
  • Self-managed DR: These companies manage their own DR programs and are usually heavily integrated with the cloud. IT chooses hot, warm, and cold options depending on application priority and RPO and RTO. Hot options include immediate automated failover to the secondary site upon a threshold event. Warm options enable a failover site that IT manually launches as needed. Cold options present an environment that IT can prepare and launch when needed. The same IT group might invest in all three services, according to budgets and differing application priority. Whatever combination IT chooses, it's absolutely critical that they periodically test all three options.
  • Cloud-based Disaster Recovery as a Service (DRaaS): Instead of internally managing DR, IT works with a cloud-based DR services provider to develop a custom plan. The provider is responsible for building and maintaining infrastructure and verification. It's not a hands-off process; IT works closely with the provider to communicate service levels and DR priorities, and to expand as needed. The DRaaS provider will do the heavy lifting of deploying and managing the recovery infrastructure and verifying recoverability.
What Is Recovery Assurance?
Let's talk more about verifying recoverability, or recovery assurance (RA); it's also called guaranteed DR, reliable DR, DR assurance or DR testing. At the simplest level, RA simply means doing enough testing on your backups and replications so you know you can recover systems in the event of a failure.

However, as the old saying goes, the devil is in the details. RA can be complicated because IT needs to pay ongoing, consistent attention to keeping applications continuously available. This is not a trivial undertaking.

First of all, the DR environment is subject to entropy.IT needs to regularly carry out DR testing to keep production and recovery environments in sync. Once a year, or even once a quarter, is probably not enough. Data grows, OSes are updated, patches pile up, applications are upgraded to new versions and so on.

RA also needs to be non-disruptive. As critical as it is, testing cannot compromise production. In order to test DR, some IT organizations try to take evenings and weekends to verify recovery operations.

Finally, RA must be robust enough to detect and flag issues so IT can correct them. Typical examples include application-inconsistent backups across multiple interdependent VMs, failed backups and corrupted backups (see Figure 1).

[Click on image for larger view.] Figure 1. Your Recovery Assurance plan needs to detect mutliple points of failure.
What Do You Need in an RA Solution?
  • Automated testing. Manual testing has its place, but only automated testing can sufficiently test DR across a variety of service levels and applications. For example, critical applications might require a near-continuous process of backup and time-to-failover testing. Less-critical applications won't have to be tested as often, because IT will have the time to rebuild application servers in-house and restore data.
  • The ability to verify application-consistent backups. You'll likely be dealing with multiple VM applications and boot order orchestration, and possibly OS and software upgrades. Your RA solution needs to ensure that all of those things are checked out and in order.
  • The ability to account for failed backups. RA needs to note that a backup has failed and flag it, so IT can fix any problems and reissue a backup command. It also needs to verify completion and check for corruption on finished backups, to avoid being dependent on an unrecoverable backup set.
  • The ability to do sandbox testing. Verifying recovery in a sandbox environment lets you test and tweak your DR plan without disrupting production.
How To Deliver RA
You can do RA in-house or as a purchased service, or a combination of the two. There are two do-it-yourself approaches to RA: manually and using a third-party tool. The first method is usually inadequate, because a sufficient RA process is long and complex. If you're going to do RA in-house, it's far better to go with third-party tools and use one of the automated toolsets available.

These toolsets will enable you to non-disruptively test multiple applications for recovery. Although highly automated toolsets will probably cost more, they'll come with two big benefits: they'll work more thoroughly, and they'll let you test more frequently. When it comes to RA, don't practice a false economy.

The second major RA option is to turn it over to a cloud-based DR-as-a-Service (DRaaS) provider. DRaaS in the cloud is worth looking into, with more and more established vendors entering the market every day. Benefits include offloading RA expertise from your staff to a provider who already specializes in it, and reducing risk by entrusting your RA process-to-recovery experts. If cost is a consideration (and when is it not?), you can selectively move critical applications to the service and take care of the rest in-house (see Figure 2).

[Click on image for larger view.] Figure 2. Advantages vs. disadvantages of DRaaS.
The Long and Painful Road
There's good news and bad news about RA. The bad news is that there is no safe escape from the complexity of providing effective DR. The good news is that DR and RA technology and services are getting better all the time, and are much more accessible across the board to enterprise, small to midsize enterprise and small to midsize businesses. Strongly consider taking advantage of these tools and services. The time and money you might save by not doing DR well is negligible compared to the cost of a long and painful recovery process.

About the Authors

Jim Whalen is an analyst specializing in data protection at the Taneja Group (tanejagroup.com).

Christine Taylor is an analyst specializing in data protection at the Taneja Group (tanejagroup.com).

Featured

Subscribe on YouTube