How RPO and RTO Impact Business Continuity

All backups are not created equal. While having any backup is better than having no backup, the different approaches to backups can have dramatic effects on business continuity, should disaster strike.

To understand what makes one backup approach different from the next, you must understand the concepts of recovery point objective (RPO) and recovery time objective (RTO). Both have been discussed at length numerous times, but here's a quick recap:

RPO represents the maximum data loss you can afford as a function of time. Most backups occur periodically, with each backup event being a recovery point. The RPO is a combination of the time between recovery points and the time it takes that data to be sent to the backup repository.

If you make a backup of a workload every 15 minutes, and you do so to a storage device that can accept this backup instantaneously, then the RPO of this backup solution is 15 minutes: In the worst-case scenario, rolling back to the last backup represents a loss of 15 minutes' worth of data.

If, however, you're sending the backup data off-site, this can get a lot more complicated. Assume for a moment that a backup is taken and sent to a Cloud Storage Gateway (CSG). This CSG acts as a buffer, which absorbs the backup and slowly unspools it to a cloud storage location. If this process takes the whole 15 minutes between backups, then the RPO is in fact 30 minutes: 15 minutes between backups, plus 15 minutes to ship the backups off-site.

RTO is the measure of how quickly workloads can be restored from a backup. If the backup can restore workloads and/or data instantly, then the RTO is zero seconds. No RTOs are realistically zero seconds, but some backup designs can get RTOs down into the single-digit seconds, even for complicated workloads.

An RTO measured in single-digit seconds usually means that the backup storage is able to mount the data or workloads directly, without having to transfer the data to another storage device. Both on-premises and cloud-based backup solutions allow this, meaning that near-zero RTO is simply a matter of both design and money.

Not all organizations have the money to accomplish this, however, or are restricted in the backup designs they can use due to regulatory compliance concerns. In these cases, the time it takes to restore from backup becomes part of the RTO calculation.

Time to restore from backup can be due to multiple factors. In extreme cases, backups may need to be physically couriered from a backup vault. Where cloud storage is used, backups may need to be downloaded from the cloud before they can be restored. Even when backups are on-premises, there is often a delay imposed by copying data from the on-premises backup solution to the production environment.

Local Backup Copies
While this is a good introduction to the basics of RTO and RPO, it's also a gross oversimplification. The design of an organization's backup approach can affect both RPO and RTO in context-dependent ways.

Many organizations follow the 3-2-1 approach to backups: make three copies of your data, storing that data on at least two mediums, with at least one copy off-site. The most basic design that meets this requirement is to back up all of your data to a local backup repository, and have that backup repository spool data to a cloud storage location.

For the overwhelming majority of backup scenarios, this means that RPO and RTO are relatively small: there are copies of data on the same site as the production workloads, so it shouldn't take long to run the backups, or to restore them. Most restore requests are made because of accidental deletions, or similar mundane errors. For these scenarios, local backup copies make abundant sense, and they minimize the effects of data protection events on business continuity.

The ultimate evolution of the local repository is the immutable snapshot. Instead of having to copy data off of a production storage solution in order to back it up, near-instantaneous snapshots are taken. Here, the time required to take a backup can be reduced to single-digit seconds. This allows for more regular backups to be taken, lowering RPO.

Because the snapshots live on the same infrastructure as the production workloads, RTOs are functionally instantaneous. Snapshots are a good thing, but they come with a caveat: If something happens to that production infrastructure, both the production copy and all the snapshots are affected at the same time.

In order to alleviate this, most storage solutions that make heavy use of snapshots also offer replication. Synchronous replication means that data is written to both the primary and the backup storage device simultaneously. Asynchronous replication means that there's a delay (often heavily influenced by distance and the speed of light) between writes occurring on the primary and secondary storage solutions.

All of this is useful and good, but becomes further complicated when an organization's IT infrastructure spans multiple sites, not all of which are directly operated by the organization. Today, it's not uncommon to see organizations that operate multiple datacenters, make use of infrastructure from one or more services providers, in addition to making use of multiple public clouds.

Data Fabrics
Data fabrics are an emerging solution to both production and backup storage that solves the problems associated with storage, which spans multiple sites and infrastructures. Data fabrics absorb all of an organization's storage into a single solution, and then distribute that storage throughout the fabric based upon rules and profiles applied to each different class or container of storage.

Data fabrics rely on administrators describing storage to the fabric. Administrators attaching storage need to ensure that the fabric understands failure domains. Fabrics may implement failure domains as zones or sites, and administrators need to be careful that all storage that can be impacted by a single disaster is appropriately grouped.

Once appropriately set up, fabrics can be a powerful storage solution with robust data protection. Administrators may, for example, tell the data fabric that they need a given storage LUN to exist in three copies throughout the fabric, and that regular snapshots be taken every 5 minutes.

The fabric would then ensure that the production site has a copy of the production data, and all relevant snapshots. In addition, two more copies would be distributed throughout the fabric, wherever storage capacity existed. Unless told to keep a second copy on the production site, the fabric would ensure that none of the additional data copies were kept on the same site as the production data.

Data fabrics move data around as needed to meet performance and resiliency requirements. They typically have a combination of high-performance/high-cost storage, and low-performance/low-cost storage. Cold data -- for example, snapshots that haven't been accessed in some time -- would be moved to low-cost storage (archival to the cloud, for example). Frequently accessed data would be retained on high-performance storage.

Intent-Based Storage
Data fabrics represent a new approach not only to designing production storage, but to designing data protection solutions, as well. Instead of assigning storage to individual physical devices, or worrying about juggling the RTOs and RPOs of individual workloads, data fabrics provide an intent-based approach to storage.

The basic concerns of RPO and RTO do not go away with data fabrics. Instead, the fabrics abstract the details of these concerns behind an easy-to-use management interface and a set of algorithms designed to manage the common, basic storage tasks in an automated fashion.

Administrators define their intent in the form of storage policies and profiles, and the data fabric takes care of the rest. When more performance is needed, you simply add more performant storage to the fabric, and the fabric rebalances. When more capacity is needed, you simply add more capacity.

With data fabrics, the design of an organization's storage and backup solutions is constantly adapting to meet present needs. This allows administrators to focus on providing the hardware and bandwidth required to keep things functioning smoothly, instead of worrying about how to connect it all together.

About the Author

Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.


Subscribe on YouTube