In-Depth
How Best To Protect and Access Your Company's Data
As much as preserving our data matters, access to that data is equally important.
In 2018, you have to work to find an organization that isn't utterly dependent on computers. Data has become as important as any tangible asset. As much as preserving our data matters, access to that data is equally important. Uptime matters.
One of the most important assets any organization has is its data. Unfortunately, as we are all aware, many organizations ignore data protection until it's too late. Ransomware became such a problematic epidemic specifically because of the habitual neglect that organizations have for basic backups.
In 2016 Dell EMC ran a survey on data protection. This survey reported that the average cost of data loss was $900,000 USD, and the average cost of downtime was $555,000 USD. As the cost of both data loss and down time have only gone up throughout my lifetime, it's reasonable to assume that if you were to run the same survey today, those costs would only be higher.
Unfortunately, backups are easy to see as a nice-to-have instead of a must-have, and this is exactly the approach both organizations and individuals seem to take by default. For this reason I've written a seemingly endless number of articles about backups. I'm sure I'll write many more.
Yet backups aren't the only aspect of data protection that organizations ignore. Storage availability is all too frequently neglected, often with punitively expensive results.
That Speed of Light Thing
Backups exist to protect against data loss. This loss can be because data is deleted or overwritten, or because the storage device upon which it rests has been destroyed. Because backups primarily exist to protect against statistically unlikely events, restoring from backups is rarely expedient.
In addition, there often exists a gap in backup data called the recovery point objective (RPO). RPO is a measure of the time between backups. More important, RPO is a measure of the maximum amount of data that might be lost if you have to roll back to the last backup.
Data is lost when rolling back to a backup because it takes time for data to travel between the production system and the backup location. Even if you're using continuous data protection (CDP), which is essentially real-time streaming of writes from the production location to the backup location, the speed of light says that there will always be a delay in getting data from A to B.
That delay represents the theoretical minimum RPO of any backup solution. In many cases the delay is only a few milliseconds, but it only takes a few milliseconds to store dozens or even hundreds of transactions on a Web site, or a critical update to a human resources record.
Within a single datacenter, the distance that data would have to travel between any two systems is so short that the travel time is less than the speed at which storage devices perform transactions. In many cases, real-time replication of data is possible between two systems within the same city, with a distance between the two systems of 100km being the generally accepted maximum distance.
100km isn't very far, in disaster avoidance terms. A power failure, hurricane, earthquake or other event could easily affect both locations. For this reason, backups are typically stored at much greater distances from the production system, even though this means that rolling back to the latest backup could mean data loss: data is sacrosanct, and it must be protected against even statistically unlikely events.
High Availability vs. Backups
Backups are incredibly useful for recovering deleted files, but -- because of that RPO gap -- far less so for running workloads. That being said, the equipment on which running workloads operate can, does and will eventually fail. Nothing lasts forever.
Because the cost of downtime is so high, investment in technologies that allow for real-time replication of data between two or more storage devices with no data loss (an RPO of zero) is usually called for. This is high availability (HA). As already discussed, the distance between storage devices in an HA solution is largely dictated by the speed of light.
For many organizations, HA consists of two storage devices located physically one on top of the other on the same rack within a datacenter. For some organizations, HA will be performed between storage devices located at two different datacenters that are situated on a Metropolitan Area Network (MAN). In the finance industry in particular, it is common to have two data centers on a MAN providing HA, and a third datacenter in another geographic location as the disaster recovery site.
Data Migration Affects Uptime
It's easy to understand HA in terms of device failure. Bobby Breakit trips over his untied shoelaces, takes a header into the storage rack, and makes the primary storage array go boom. Bad Bobby, do not break it.
There are, however, a number of other perfectly routine and innocuous reasons to invest in HA; updates, for example. Everything in IT needs to be updated eventually. Many HA storage solutions can be set up to seamlessly switch between physical devices, allowing administrators to apply updates to the secondary device while the primary continues to serve workloads. Fail back over to the other device, and you can apply workloads to that one, too.
When the primary storage has a little lie down -- regardless of the cause -- avoiding the costs of downtime requires that there be a secondary storage array to take over. In a perfect world, the handoff between the two is seamless, and running workloads never notice that their storage is now being served from a different device.
Not all storage solutions are capable of this kind of HA. Some solutions can replicate data between two devices in real time. However, failover between the two devices is not seamless, and workloads will crash if a failover event is forced. These less advanced HA solutions don't lose any of the data written to the storage devices, but require workloads to be restarted if a failover is forced, which causes at least some downtime.
This latter failover scenario is quite common when data movement is occurring between two dissimilar storage devices. Data migration activities due to datacenter upgrades, or due to space saving or archival efforts are common causes of outages. These outages are often a "hidden cost" of storage upgrades and maintenance: in addition to the cost of the equipment, you must factor in the time necessary to switch over to new devices, or repoint workloads to data that's been moved.
Avoiding Downtime
While all of this is still true for traditional storage arrays, storage technology has come a long way in the past decade. Data fabrics -- often nebulously described under the nearly useless header of "software-defined storage" -- remove the need for downtime, even when adding or removing storage devices from an organization's datacenter.
The short version of how data fabrics work is as follows: a highly available cluster of servers acts as a data presentation layer. In turn, this data presentation layer controls any and all storage that it's fed.
The data presentation layer presents storage to running workloads. Because this presentation layer is itself an HA cluster, the presentation layer can survive the loss of a node that's performing data presentation activities.
The data in a data fabric is spread across all storage devices made available to the fabric. This storage could be traditional storage arrays that have offered some or all of their storage to the data fabric. And it could be in the form of whitebox servers full of disks.
The storage could also be drives attached to the nodes hosting the data presentation software itself. When a data fabric is designed in this manner, storage industry nerds call it "scale out storage." When a data fabric not only puts storage drives in the data presentation nodes, but also allows administrators to run production workloads on those same nodes, it's generally referred to as hyper-converged infrastructure (HCI).
Backups are a good and necessary thing, but because of that RPO gap, most organizations prefer to avoid rolling back to them if at all possible. Doing so requires HA, and data fabrics are the best solution currently available to provide HA.
The true beauty of data fabrics is that you don't have to throw away your existing investment in storage hardware to take advantage of them. There are numerous solutions available that can marry your traditional arrays with whitebox servers, and even blend this with HCI. When your next storage refresh crops up, talk to your vendors about data fabrics: They could help avoid costly downtime, and they're quickly becoming the mainstream storage solution of choice.
About the Author
Trevor Pott is a full-time nerd from Edmonton, Alberta, Canada. He splits his time between systems administration, technology writing, and consulting. As a consultant he helps Silicon Valley startups better understand systems administrators and how to sell to them.