Getting Started with Cloud Backups -- Virtualization Review

How-To

Getting Started with Cloud Backups

Few companies can afford to maintain a second site for backup purposes, which means cloud backups are becoming increasingly popular. Here's what you need to know to get started.

By Trevor Pott
02/20/2018

If your data doesn't exist in at least two places, then it does not exist. Storage failure events happen to everyone, and the key to preventing them from becoming data loss events is performing backups. Few companies can afford to maintain a second site for backup purposes, which means cloud backups are becoming increasingly popular. So what do you need to know about cloud backups?

This article is the first in a series of articles focused on backups, disaster recovery and cloud computing. Everything you need to know about cloud backups simply won't fit into a single article. That said, if there is one message that I hope you retain from this entire series it is this: Despite popular belief, the word "cloud" doesn't mean you can disengage your brain.

Yes, the word cloud means that at least some part of the IT solution in question is a service that is managed by a services provider. No, that doesn't mean you can avoid planning, careful consideration of architecture or any of the other things that made backups annoying before the public cloud came along.

The Basics
Systems administrators who have been around for a while will be familiar with two important backup acronyms: recovery point objective (RPO), and recovery time objective (RTO). As always, when talking about backups, these will be an important part of the discussion.

RPO is the maximum amount of time you can afford to lose data. Unless you're using true "RPO 0" continuous data protection (CDP), there will be backup gap. If you run backups every day, then the maximum gap between when your production data fails and the age of the data in your backup archive should be one day.

RTO is the measure of how quickly you must have data restored before bad things happen. If your organization is a bank, for example, having the transaction processing system down for even an hour is probably going to lead to a customer backlash. The stock images folder that marketing uses to put pictures in their PowerPoint presentations, however, can probably wait a little longer.

Each workload's data is likely to have different RPOs and RTOs. While this may seem an incredibly simplistic "backups 101" sort of thing to say, when talking about cloud backups the importance of knowing the RTO and RPO for each workload and dataset must be emphasized.

The Bandwidth Problem
Any backup design has to balance two conflicting priorities. The first priority is to avoid overcomplicating backups. Complicated backups lead to problems, and this is the one area of IT where you absolutely cannot afford to have problems.

The second priority regarding backups is the need to take the guesswork out of backups. There cannot be any guesswork in backups. They absolutely must work as expected every single time. No exceptions.

Unfortunately, the sure-fire way to solve the problem -- take full, complete backups of all the data regularly, and often -- doesn't work in the real world. Neither disk space nor bandwidth is unlimited. This means that being choosy about what to back up and when is a requirement.

Being choosy about backups requires knowledge because you should back up only what is needed. Technologies such as incremental backups, deduplication and cloud backup gateways can help mitigate some of this. Properly configured, each can offer a means to send significantly less data from the production site to the backup location when compared to a full backup.

But these technologies are no substitute for knowing your workloads and datasets, and then making rational choices about RTOs and RPOs. Furthermore, these decisions must be made with the involvement of all stakeholders. They cannot be the sole province of systems administrators.

If, for example, your database administrator is expecting that backups will be taken every hour, and backups are instead being taken every day, the likelihood of eventual conflict approaches unity.

Needs assessment is a fundamental part of the job of systems administrators. Those considering cloud backups must balance the need to keep backup designs simple against the need to have a well-informed, multi-stakeholder design, and fit the whole thing into available bandwidth while meeting RTO and RPO requirements.

Regulatory Compliance
An additional consideration to all of these requirements is regulatory compliance. For some reason, when organizations discuss regulatory compliance, backups are rarely considered. Despite this common oversight, data is data, regardless of where it's stored. If data falls under a regulatory umbrella as part of a production workload then it's regulated as part of a backup set, as well.

For those who have trouble explaining this to individuals higher up the food chain, sometimes a bit of popular culture helps. There are a number of science fiction movies and television shows where some pivotal plot point relies on someone deleting information from the main computer system, but then forgetting to delete that information from the backups.

This is an important concept because a regulatory requirement to defend against exactly this scenario is part of the next generation of data privacy regulations, not the least of which is the European Union's General Data Protection Regulation (GDPR). Contrary to popular belief (and a great deal of marketing), the GDPR is not about IT security. It's about the privacy rights of EU citizens.

Modern data privacy regulations hold organizations responsible for data losses regardless of whether the data in question was pilfered from production systems or from backups. Organizations are even held responsible if the breach that led to the data loss was the fault of a cloud provider that the organization chose to use. The organization that collects the data from the customer is responsible for the entire lifecycle of that data, including ensuring that all third parties that handle that data keep it safe.

In addition to simply keeping data safe, modern data privacy regulations enshrine the right for individuals to request that their data be deleted. This "right to be forgotten" requires that the data subject's data be deleted from backups and from production systems. This requirement is something that few applications or backups solutions are ready to accommodate, and may impact the RPOs and/or number of copies on which organizations decide.

Real-World Considerations
After decades of being told that all organizations should retain as much data as possible, the tide is turning. Bulk Data Computational Analysis (BDCA) tools acting on Big Data warehouses can indeed deliver insights to organizations. Combining the data from multiple organizations that form a single supply chain can feed machine learning tools that result in higher production yields, lower product returns and increased customer satisfaction.

The flip side of this is that holding data is a very real risk. If you don't store data then you can't lose it. Data minimization is a now a very serious consideration, one that applies as much to backups as it does to production systems.

Moving backups to the cloud takes some part of those backups outside of the organization's control, and it puts those backups on the other end of an expensive Internet connection. But using the major public clouds or the cloud offering of regional services providers as your backup target does offer very real benefits. Not having to fret about the underlying infrastructure, or pay for an entire second site saves time and money.

Before making the leap, however, it's worth the effort to ensure that as little data gets backed up as possible. To the extent that it's possible, avoid complicating backups, but invest the time required to take the guesswork out of backups. Bear in mind that the word cloud doesn't give you license to disengage your brain. And, of course, remember that if your data doesn't exist in at least two places, it simply does not exist.

About the Author

Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.