In-Depth
Disaster Recovery 101: Back Up, Restore and Verify with VMware
From vSphere Replication to SRM, Trevor Pott discusses disaster recovery basics.
When is a backup not a backup? When you can't get information out of that backup. Backups are worthless if you can't actually restore from them, a truism that underlies much of the real-world planning around disaster recovery.
VMware's broad portfolio of applications and services provides an excellent basis for examples and discussion of disaster recovery issues. Chief among the relevant products and features are vSphere Replication, VM Encryption and Site Recovery Manager (SRM).
Building on its popularity in the on-premises datacenter, VMware's backup and disaster recovery solutions tie into its VMware Cloud on AWS offering, as well as the myriad vCloud Director (vCD)-based services provider clouds. VMware, of course, also has a broad ecosystem of partners who provide solutions in this space.
Restore Ability
There are a number of reasons why information might not be able to be restored from a backup. One of those reasons could be that the backups were lost, perhaps because they were stored on the same site as the primary data. For VMware administrators this failure mode might happen if they relied only on vSphere snapshotting capabilities for data protection and nothing else.
Snapshots offer the ability to roll back after a change (for example, updates) proved not to work. Snapshots are not backups.
But even administrators who have invested in off-site backups can also find their backups don't provide them the ability to perform restores. The encryption keys might be lost, backups corrupted, or they might no longer have access to the application required to decode the metadata and extract usable information.
VMware has solutions to these problems. Encryption issues are handled through integration with a key management server. Corruption and metadata sanity are addressed by using the testing capabilities built into SRM.
In addition to being able to extract data from a backup, you must have somewhere to extract that data, and a plan to do something with it. A small organization with a single premises might have a disaster recovery plan that consists of "Do not return to operation if the building burns down," or it could turn to insurance to replace losses, knowing that rebuilding -- including replacing IT infrastructure -- could take months.
In these, and other similar scenarios, backups are viewed by the organization primarily as a means to provide a check against ransomware and other non-disaster recovery scenarios. Should disaster strike, the primary use of the backups is likely to be extracting information for tax and insurance purposes.
Many businesses, however, intend to continue operating, even in the face of a disaster. Even for the smallest of organizations there are disasters that can take out an organization's on-premises IT without being a complete loss to the whole organization. Floods, electrical surges that fry computers and theft all qualify.
vSphere Replication solves this problem. vSphere Replication is a feature included with versions of vSphere starting from the Essentials Plus package. It provides basic site-to-site replication capabilities for organizations that have multiple sites, with recovery point objectives (RPOs) as low as 5 minutes.
For most businesses, then, a backup is not a backup unless it can both extract the information from the backup to somewhere useful and act on the data that has been recovered. In other words, backups are only part of the equation. Disaster recovery planning is a necessary component. This is where SRM comes in.
Verification Matters
SRM is VMware's data protection orchestration layer. The existence of SRM is predicated on the ideas that being able to extract information from your backups should not be left to chance, nor should organizations blindly trust that the backup solution will work. Verification of the backups matters, and SRM exists to make sure that organizations can be certain that their backups will do the job.
Most backup applications and services come with some form of backup verification. This is used to ensure that the backup solution is backing up what it intended to back up. It does not guarantee that an organization is backing up what it expected to back up, nor does it guarantee that the backup application made sense of what it was supposed to back up. It merely verifies that what went into the backup application on one end came out the other.
True backup verification requires more than technology. It involves a multi-stakeholder approach, where all stakeholders who are expecting backups to operate on their data have the opportunity to verify that the data they expect to be backed up is being backed up. And, more importantly, that it can be restored as needed.
Another important consideration is that a backup is restored in the manner that is expected. A backup solution that relies entirely on snapshots, for example, may not offer the ability to do file-level restores from operating system environment (OSE) images. Similarly, the ability to do granular restores from databases or message-level restores from mail servers might not exist. VMware administrators seeking these features need to turn to ecosystem partners, as VMware's own data protection solutions do not provide any application-level functionality.
All stakeholders should not only be sure that their data can be recovered, but they must be made aware of the time these recoveries will take, and the steps required to recover that data. Most important of all, everything needs to be tested to ensure that what the stakeholders think should happen is, in fact, what will happen.
Networking Concerns
For those who are backing up entire workloads, restoring those workloads so they can be put back into service can be tricky. Restoring a workload to its original environment is usually reasonably simple, but restoring a workload to a disaster recovery site rarely is.
The most commonly encountered issues are related to networking. For example, unless you're using layer 2 bridging between sites -- a networking solution that remains uncommon -- static IP addresses cannot be used. The new site will have a different network configuration, and in order to be accessible workloads will have to have completely dynamic networking. This may include a requirement that the workload be able to have its virtual NIC replaced as part of the restore process.
For many workloads, there is more to be considered than simply ensuring that IP addresses are dynamic and DNS gets updated with the new IP. Workloads that are expected to be accessible from outside the firewall will need to have firewall rules created or altered to reflect the restoration of that workload to the disaster recovery site.
VMware provides NSX as the solution to networking woes. VMware products such as SRM integrate with NSX, allowing for automation and orchestration of network configurations as required for data protection and testing purposes.
For some workloads, ensuring that the networking part of the equation is handled is time-consuming and difficult. It's also critical to solve this problem before disaster strikes. The absolute wrong time to discover that you cannot access your workloads after restore is when the primary site has gone down and you need those workloads up immediately.
Cloud Disaster Recovery
The cloud offers a cost-effective way to do backups and disaster recovery. There's no need to establish, operate and maintain a second site. An organization simply rents space on someone else's datacenter, sends the backups there, and if disaster strikes, lights up the workloads on the cloud provider's infrastructure.
Not all cloud providers are equal, however, and some are far less forgiving than others. Consider those cloud providers that don't offer console access to virtual machines. If the networking problems discussed earlier are not resolved, then any attempt to light up a backed up workload risks bringing up a workload with networking that's not appropriate for the environment, and thus cannot be accessed by OSE-level remote access tools. For organizations that don't adequately and regularly test their backups, picking a cloud provider that offers console access is critical.
In addition, cloud providers update their infrastructure and solutions at different rates. When an organization is backing up on-premises workloads to a cloud provider, it's important to understand the planning and deployment cycles of that cloud provider. Ensuring that your on-premises hypervisors, choice of OSEs, backup solution and cloud provider are all compatible is an ongoing process.
VMware's headline cloud offering is, of course, VMware Cloud on AWS. VMware Cloud on AWS is essentially a fully featured VMware solution installed on bare metal servers and hosted in an AWS datacenter. For those not interested in VMware Cloud on AWS, there are numerous vCloud Director-based services providers around the globe.
Regional services providers and managed services providers that have evolved into cloud services providers set themselves apart from the more mainstream public clouds. Services providers tend to offer more personalized management and even full-service consulting to help organizations solve their backup and disaster recovery concerns. They also offer solutions to data sovereignty and privacy concerns that may not be adequately addressable by using a solution backed onto one of the major public cloud providers.
Regardless of your vendor ecosystem, always remember: If your data doesn't exist in at least two places, it doesn't exist. If you haven't verified that your data can be restored to a useful state, then it also doesn't exist. Test rigorously, and test often. And when possible, involve your cloud services provider directly in your efforts.
About the Author
Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.