In-Depth
Backup vs. Disaster Recovery Best Practices
Not all backup and DR solutions support all methods, and not all methods are appropriate for all workloads. Choose wisely.
Getting backups right can be difficult. Making disaster recovery (DR) work is still even more difficult. Throw cloud technology into the mix and you potentially add layers of complexity with networking, virtual machine format considerations and more. Somewhere, at the center of this discussion, there has to be some best practices, so what might those be?
Backups come in flavors: Ranging from ten-cent ramen noodles to the most exquisite of gourmet meals. I've been working in IT for more than 20 years, and I'm absolutely certain that I couldn't name all the possible ways that exist to back up data or workloads, let alone discuss all the possible methods of DR.
The trite -- but nonetheless accurate -- response to any question about what you should do with something in regard to IT is: "It depends." The only universal bit of IT advice that applies to everyone, everywhere is: "Do a proper needs assessment." Everything else really depends on what you're trying to accomplish, and why.
For this reason, IT practitioners often stress education. Doing so gets us off the hook when people come asking, because what they really want is a free IT consultation complete with needs assessment. The physical manifestation of this is the ubiquitous, "No, I will not fix your computer" T-shirt.
In between the two extremes of certainty and apathy lies the middle of the bell curve. There are backup and DR problems that seem to recur in almost every instance. Given the importance of backups and DR, however, it's important to read the following as guideline considerations, not gospel. Nothing beats a sober assessment from a seasoned expert taking the time to analyze your unique requirements.
CDP vs. Snapshots
Perhaps the most important decision in the backup and DR space is whether Continuous Data Protection (CDP) or snapshots are the better path for your application. CDP can get very close to a recovery point objective (RPO) of zero, meaning that a failover event causes as close to no data loss as is possible.
The flip side of this is that, at current, we can't do application-consistent CDP. Being only crash-consistent, the CDP approach to backups means that any backup is going to risk losing data that was in the operating system environment (OSE) write cache, as well as potentially any data in other write caches in the storage path.
In other words, at best, CDP is the equivalent of someone cutting the power to the server on which that application was running. That's fine for some applications, not so great for others.
Snapshots, on the other hand, can leverage OSE features like Windows Volume Shadow Service (VSS) to make consistent application backups. These backups are less likely to suffer data loss due to write cache issues, and can directly involve the application in the backup process. Because they're point-in-time events (snapshots), however, snapshots will always face an RPO gap.
Further complicating matters is that WAN/Internet connectivity isn't always reliable. While this isn't a huge problem for snapshots -- the snapshot will be taken at the appropriate point in time and streamed when the link comes back up -- this is a problem for CDP. CDP is designed to "stream" writes to the destination, and doesn't like it when the link over which it's streaming goes down.
Modern CDP solutions use checkpoints. Checkpoints detect a link outage event and use write coalescing to handle WAN outages or bandwidth constraint events. With checkpoints, CDP solutions will buffer writes to be sent until the link comes back up. This makes them behave much like snapshots.
In a checkpoint scenario, if a block that was to be sent to the DR destination is to be overwritten between the time when the link goes down and the time it comes back up, the CDP solution will most likely only send the net resultant change. This allows the CDP solution to make the best possible use of its buffer capacity, preventing a need for a full resync when the link comes back up, even after an extended outage.
Having a backup and DR plan meet your expectations centers on understanding the differences between these methods of backup, and picking the right one for your application. Not all backup and DR solutions support all methods, and not all methods are appropriate for all workloads. Choose wisely.
Data Churn
Backups aren't worth anything if they can't get from A to B in the required time frame. The primary consideration in this regard is data churn. Data churn is the measure of change in a workload's data over a given period of time.
From a backup application's perspective there are two types of data churn: gross data churn and net data churn. Gross data churn is every write that a workload makes. Net data churn is the result of those writes over a period of time.
If, for example, a workload regularly overwrites the same 1,000 blocks of data, then the difference in gross and net data churn can be quite significant. A CDP solution would stream every single one of those writes to the backup destination as they occurred.
A snapshot-based solution, on the other hand, takes point-in-time backups. So a snapshot will know that those 1,000 blocks have all changed since the last snapshot, taken X hours ago, and will send the most up-to-date version of those blocks to the backup destination as they exist at the time of the snapshot.
In this case, if you don't mind the potential data loss of X hours, snapshots save a great deal of bandwidth over CDP solutions by sending one copy of those thousand blocks every hour instead of an unknown but potentially large copy of those same thousand blocks over and over.
Bandwidth Management
Understanding your workload matters. Based on the bandwidth available, high churn, high I/O workloads may produce more writes than the organization has upload bandwidth on their WAN/Internet connection.
Backups are always a balance of compromises. Everyone wants an RPO of zero, but not everyone can afford it. This makes understanding the bandwidth impact of backups absolutely critical.
Ensuring that the backup solution uses a bare minimum compression is critical. Deduplication and WAN acceleration are also very good things, if they can be had. For those who feel that they have bandwidth constraints that can't be overcome, investing in cloud backup gateways or peak hours throttling can help alleviate some pressure by shifting backup stress to off hours.
Scratch Disks
Many workloads create a scratch file or page file. This is essentially a temporary storage location that can be used as an alternative to expensive RAM by workloads that operate on a great deal of data. In addition, OSEs tend to have a page file.
For many application-consistent backups, it's important to back up the scratch file or page file to ensure that data isn't lost. For others, this isn't relevant, as the file serves to speed things up, but doesn't contain unrecoverable data. Understanding this behavior can help save money.
Scratch and page files can be quite large, and are often high churn, high IO files. One trick to save time on backups, as well as bandwidth, is to place scratch and page files on separate scratch disks and then simply not back those up.
Where workloads need a complete copy of their scratch disk to be application-consistent, and where application consistency matters for that workload, this is a bad plan. Where a scratch disk can be reasonably jettisoned from the backup set, however, this is a worthwhile investment to make in one's backup plans.
Automation and Testing
For small organizations, manually configuring backups on a per-workload basis may be adequate. For larger organizations, however, it's best to invest in a backup solution that allows the use of profiles.
By using profiles, backup administrators can ensure that when workloads are created to automatically assigned backups of the right consistency, that undergoing adequate testing after backups is taken. Where infrastructure automation exists, and can be tied into the backup solution, then backup profiles can be married to workload templates, ensuring that workloads are never created without backups that take consistency into consideration.
Automation capabilities are important to look for in backup solutions because backups that aren't taken protect nothing. In addition to automatically assigning backup configurations to known types of workloads, automation can help with unknown workloads, as well. Automation can be used to make sure that uncategorized workloads have at least some default level of protection that helps cover these workloads during the period where they're being baselined and their backup needs fully assessed.
Automation is also critical for testing backups. Backups from which you can't restore protect nothing. DR solutions with which you don't know how to engage offer no resiliency. Test everything, test it often and test it in an automated fashion where possible.
The Only Real Advice
At the end of the day, the only real backup and DR best practice advice that can be offered is to actually do them. If your data doesn't exist in at least two places, then it doesn't exist. If your workloads can't be run in at least two places, then your organization doesn't exist.
Computers are our organizations. They are our society. Care for them, and, in the long run, they'll care for you.
About the Author
Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.