In-Depth

Cloud-Based Disaster Recovery: What You Need to Know

Using a major public cloud provider or a services provider's cloud for disaster recovery (DR) requires a lot of thought. Trevor Pott runs down what you need to consider.

Using a major public cloud provider or a services provider's cloud for disaster recovery (DR) requires a lot of thought. There are very good reasons to do so, but the practice comes with a lot of gotchas. What an organization is trying to accomplish and what kind of Internet connectivity is available ultimately determine whether this is a feasible approach.

As always, when speaking of DR, the recovery point objective (RPO) and recovery time objective (RTO) are at the center of the conversation. RPOs are all about how much data you can afford to lose and RTOs are all about how quickly you can get back up and running.

The main selling point behind the use of cloud technology for DR is that it lowers the RTO to as close to zero as it's reasonably possible to be. You send your data to the cloud and, should things go sideways, those workloads can be lit up using that cloud's infrastructure in a matter of seconds.

If the backup and DR software in question contains orchestration capabilities, then this can be done in an automated fashion before the administrator even notices an outage has occurred. Whether this is a good plan depends on the workload. Backup consistency and method determine how useful cloud DR actually is in practice.

Backup Consistency
One important concept to understand is the difference between crash-consistent and application-consistent backups. A crash-consistent backup (for example, backups based on snapshots) is a copy of storage at the exact moment the backup was taken.

Crash-consistent backups should be equivalent to what storage would look like if someone had powered off the storage device at the exact time the backup was triggered. Depending on the configuration of the operating system environment (OSE), and where in the storage stack the backups are taken, crash-consistent backups can be ineffective.

For most situations, crash-consistent backups are perfectly adequate, but they do rely on applications to save data regularly. If you make a crash-consistent backup of a file server, then any open documents will only be backed up to the point of their last save.

Similarly, OSEs or other elements in the storage layer that use write caching can cause problems. Writes that are cached but not committed to storage are lost. Databases in particular do not like it when you simply power off the server on which they live.

Application-consistent backups, on the other hand, involve at a bare minimum the OSE in the backup process. Volume Shadow Copy Service (VSS) and related technologies require quiescence, (also called "stunning") of the OSE. Quiescence temporarily "pauses" all new writes in that OSE and flushes the write caches. This means that all writes that have been acknowledged by the OSE as having occurred have, in fact, actually been committed to the storage. A snapshot is then taken, which can then be shipped up to the cloud as a backup.

Application-consistent backups usually rely on OSE agents in order to communicate with the OSE -- in Windows this is done via VSS -- and may even communicate with the application in question. Application-consistent backups are typically viewed as the bare minimum requirement for database-based workloads, and many databases are VSS-aware, and capable of participating fully in the backup process.

Automation Isn't Always Good
The push toward the automation of DR makes both consistency and RPO discussions even more important. Consider a company's point of sales (POS) database, for example. At any point a sale could be logged into that database. An RPO of even 5 seconds means that failing over to the copy of the data on the DR site risks having sales go missing.

Clearly, failing over a POS database in an automated fashion can be disastrous. It's for this reason that many organizations seeking DR solutions for their databases use multi-site replication capabilities native to the database software itself.

A copy of the database software is kept online at the DR site at all times, and the production database sends every change to the backup site as those changes occur. This is ultimately the only real way to provide complete protection for databases, and is a form of continuous data protection (CDP). In layman's terms, CDP means "every single change is backed up."

Standing up and maintaining a second database instance can be costly. An organization must license the software and pay for the virtual machine or bare metal upon which to run the database. Many organizations don't feel they can afford this.

Here, the DR approach used is to alert administrators of a failover event, typically forcing them to manually acknowledge it before lighting up the database instance. They are acknowledging that the DR copy of the database will become the new "master," with all future writes going to that database. Any data written to the primary database after the last backup was taken is lost, and cannot easily be re-integrated.

Here, planning makes a big difference. Again, considering a POS database, the information being fed into that database comes from one of two places: manual entry of information, or another application that feeds into the database. If the administrator knows the time of the last backup, and when the DR failover occurred, he can "replay" those entries to re-enter data.

Replaying entries may consist of manually entering information stored on paper copies of sales that are kept for accounting purposes, or it could involve re-importing data from the application that normally feeds that database, which may store that data in its own database, or even in a series of individual files for each sale. Planning how to deal with database issues is the most difficult part of DR.

Not a Database
From a DR standpoint, there are three major workload types: unstructured (file) storage, structured (database) storage and everything else. In most environments databases are the workloads that perform the most writes. They're going to consume the most bandwidth to back up, require the most special care and attention, and consume the majority of the planning time. Like databases, it's really useful if some form of CDP for file storage can be done. Backup and DR solutions that can handle this make everyone's life easier. File storage is bursty -- files tend to be changed in clustered events throughout the day -- but rarely as intensive as databases.

"Everything else" is comparatively straightforward. Everything else includes workloads that process data. Consider a virtual machine that hosts an application that doesn't frequently write or change data locally. It might read data from an input hotfolder on a file server, crunch on that data a bit, write some metadata to a database located on the database server, and then output the results as files to another share on the file server.

These workloads only experience changes when the configuration of the application is changed, or patches are applied. These workloads fail over to DR in the cloud readily and easily. They're excellent candidates for CDP-based backup applications because past the initial replication of the data, there's very little change.

Practical DR Considerations
If you're wondering why an article about the gotchas of cloud-based DR didn't talk much about the cloud, you didn't miss anything. In the real world the gotchas of cloud-based DR are the same as the gotchas that surround non-cloud DR.

The real differences that the cloud injects into the discussion are bandwidth, cost and assumptions. I've already discussed the very low RTO that cloud technology enables. RTOs that low fundamentally change the conversation about RPOs, and that's the biggest gotcha about the cloud.

With that in mind, discussions about bandwidth come in to play. How frequently do you want to back up which data, and what bandwidth will you require to do so? This dovetails nicely into the cost discussion.

Renting space on the cloud means you don't have to establish and maintain a secondary site. That can save rather a lot of money, but these savings must be compared against the cost of bandwidth.

Whether the cloud is your DR target or not, designing a DR solution starts with the desired outcome and working backward. Everyone wants perfect application-consistent RPO 0 and RTO 0, but in the real world, few can afford it. So, what compromises can you make, and do all stakeholders agree?

Once you know the answer to that, the rest is comparatively easy. And the subject of future articles.

About the Author

Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.

Featured

Subscribe on YouTube