In-Depth

Data Fabrics and Their Role in the Storage Hardware Refresh

Data fabrics allow organizations to go a step further and spread data across multiple arrays such that entire storage arrays become redundant.

How do you refresh storage? Or, more specifically, what happens to perfectly good storage when the support contracts are up? Should organizations extend the support contract at potentially punitive rates in order to save on capital expenses, or can the previous generation's storage see valid use even outside support? 

These questions all have different answers depending on which person you're asking. Vendor employees will generally provide a very earnest response that includes a number of reasons that nobody should ever run anything in IT without a support contract. There are also apparently very good (from whom?) reasons why storage array support contracts need to get dramatically more expensive after the second or third year.

It's in the interests of the customer, you see, to be not-so-subtly nudged toward refreshing their storage every three years; even if that storage happens to be working just fine, and meeting the organization's needs. While this approach may work in organizations where money is no object, throwing away perfectly serviceable storage and then spending hundreds of thousands or even millions of dollars to buy replacement equipment seems to be a sore point for most other organizations.

Drive Locking
Regardless of the specific flavor of storage in use, there are always two components to any datacenter storage. There's the bit that stores the data (drives), which can be either rotating magnetic media (spinning rust), or solid-state drives (SSDs). There's also the bit that chooses which drives will hold what data, typically referred to as a controller.

The drives in an enterprise storage array are usually not very different from the drives in a consumer network-attached storage (NAS). There's usually a custom firmware applied to the enterprise drives, but how much this alters the performance or reliability of the drives is still, after decades, mostly unknown.

What the custom firmware does do is give storage vendors a way to prevent organizations from simply replacing the drives in their storage array with generic, off-the-shelf replacements. Many storage arrays -- especially the more expensive ones -- will reject drives that do not have the vendor's custom firmware, even if the drives purchased are "enterprise" or "datacenter" drives.

With consumer storage devices, if the drives are starting to fail, or if free space is getting a little low, you can simply log on to Amazon and buy a few dozen replacement drives. The vendor of a consumer storage solution makes its money in selling the controller, and it uses unmodified consumer hard drives. If consumer array vendors can do this, why can't enterprise storage vendors?

Storage Refresh Rationale
The rationale for the enterprise storage vendor approach to drive replacement is complicated. Vendors who are selling enterprise storage can't afford to sell storage arrays that aren't rock solid. Consumers and small businesses can tolerate failure rates and downtime that enterprises won't. This means, among many other things, that vendors don't want anyone using drives that haven't been 100 percent fully tested for use with the controller's hardware and software.

This is complicated by the fact that consumer drives aren't homogenous. A Western Digital Gold drive, for example, is considered to be an enterprise-class hard drive by Western Digital. If you wanted a top-notch hard drive to install into a server, or put in your consumer NAS, this is the drive for you. But Gold is a brand, it's not a homogenous line of identical drives.

Even within a specific capacity, say 12TB drives, there are multiple production runs of the drives, each with their own quirks and foibles. In addition, Western Digital is constantly tweaking the design and firmware of the drives. The result is that over the full run of 12TB Western Digital Gold drives there may be dozens of different models of drives that all share the same brand.

If consumer storage can cope with this, why can't enterprise storage? The answer is that enterprise storage absolutely can cope with this, but that storage vendors don't want to deal with the hassle, and for good reason: Sometimes failure rates can get ridiculous. The three big cases of high failure rates that immediately come to mind are (in reverse chronological order), the Seagate ST3000DM001 (dubbed the "Failcuda"), the OCZ debacle and the IBM Deskstar 75GXP (dubbed the "Deathstar"). The OCZ debacle is notable because it shows us that SSDs can be just as awful as spinning rust.

There are rumors of other drive models with failure rates that are worthy of inclusion in that list, but getting data on failure rates is hard. Storage vendors (with the notable exception of Backblaze) keep such data a closely guarded secret, and it usually takes a judge ordering disclosure to get access to it.

Because this data isn't shared throughout the industry, figuring out which drive models are likely to cause problems is next to impossible. What vendor wants to be on the hook for the reliability of a storage array running mission-critical workloads if the customer can just start putting in fail-class drives?

In addition to worries about drive failure rates, vendors worry about controllers. The physical hardware in most controllers is no different from what's in your average x86 server, and that means that most of them will last for at least a decade. But a controller designed to support 1TB through 6TB drives may not perform appropriately if you start cramming 12TB drives into it.

Storage arrays have RAM that they use for caching, as well as for executing storage controller functions such as deduplication, compression, encryption and so forth. In enterprise storage arrays some or all of this RAM may even be non-volatile RAM (NVDIMMs), which is designed to allow the controller to use write caching without having to worry about data loss due to power failures.

Installing larger drives than the array is designed for could mean that the controller no longer has enough RAM to do its job appropriately. In the case of controllers with NVDIMMs, part of the reason for three-year refresh cycles is that both the batteries and the flash that enable the NVDIMMs to be non-volatile have lifespans much shorter than the rest of the controller.

Data Fabrics
From the point of view of the customer, vendor drive locking is annoying. It appears to be nothing more than a cynical attempt by the vendor to squeeze money out of a captive audience, and milk its customer base those vendors absolutely did. Somewhere around 2009 a large number of startups began to emerge with new approaches to storage, most of which were aiming to capitalize on frustration with the status quo.

Complicating the simple David-versus-Goliath narrative, however, is that storage vendors absolutely did have good reasons for engaging in drive locking and other practices aimed at driving short refresh cycles. At least, they were good reasons from a certain point of view.

Storage vendors sell their storage assuming that they will be the only storage solution in use for a given workload. Customers are expected to perform backups, but the storage array itself must always be up, always delivering rock-solid storage. Lives may depend on it. This approach to storage was established before data fabrics existed, and entire generations of engineers, storage administrators and storage executives were raised with this thought process.

Today, however, we do have data fabrics. A storage array has redundant drives to protect against drive failure, and even redundant controllers to protect against controller failure. Data fabrics allow organizations to go a step further and spread data across multiple arrays such that entire storage arrays become redundant.

Most data fabrics offer much more functionality than simply writing data to multiple storage arrays in order to protect data against the failure of an entire array. Data fabrics also continuously reassess the performance of all storage in the fabric, and then place data on appropriate storage, based on the profile settings for the workloads using that storage.

If an array is underperforming (say, because you stuffed it full of higher-capacity drives than it was designed for), the data fabric treats that array as a cold storage destination. This means that the data fabric would only put data blocks on that array that had been determined to be unlikely to be frequently accessed, or which are part of a workload whose storage profile says that workload is latency-insensitive.

Frankenstorage
The beauty of data fabrics is that you can use any storage you want. Do you want to build whitebox storage out of Supermicro's utterly ridiculous lineup? Go right ahead: You'll find units allowing you to cram 90 3.5-inch spinning rust drives into 4u. Or how about 48 NVMe drives in a 2u server? 

Supermicro is easy to point to for extreme whitebox solutions, but most server vendors are providing Opencompute solutions, and Opencompute Storage is now a thing. There's also the Backblaze Storage Pod for those looking to grind every last penny out of their whitebox storage designs.

The basic premise of data fabrics is that you put together a whole bunch of spinning rust for bulk storage purposes, and then add SSDs (increasingly NVMe SSDs) for performance. If you need more capacity you add more spinning rust. If you need more performance, you add more SSDs. You let the fabric figure out where to put the data, and just by adding what you need, when you get the job done.

But data fabrics don't have to be whitebox storage. Many data fabrics allow administrators to add any storage they happen to have access to into the fabric. Your old storage array that's out of support? Add it to the fabric, scour eBay for replacement drives and run that array right into the ground. Cloud storage? Punch in your credentials and voilà: additional capacity.

If and when your no-longer-supported array finally dies, that's totally OK: the data that was stored on it is replicated elsewhere in the fabric, and the fabric will continue serving workloads uninterrupted. Most data fabrics will sense the loss of the array and begin replicating additional copies of the data to compensate for the failed storage.

Traditional storage administrators often refer to data fabrics derisively as "Frankenstorage." Data fabric vendors are terrified of anyone using the term because they feel it has negative connotations, but in fact it is the perfect analogy. Data fabrics allow organizations to take any number of different storage solutions from any number of vendors and weld them together into something that is far more than the sum of its parts. Like Frankenstien's monster, data fabrics are deeply misunderstood, and the subject of unwarranted Fear Uncertainty and Doubt (FUD).

Any new technology in IT faces opposition, but data fabrics are no longer just a science project. There are multiple vendors offering proven solutions, and they're starting to have real-world impacts on how organizations handle storage refreshes. If you're looking at yet another round of forklift upgrades, it's a technology worth considering.

Featured

Subscribe on YouTube