In-Depth

Big Data, Big Challenges

Virtualization and converged infrastructures are driving storage networking.

Prompted by the continuing explosion of big data and the need to exploit its lucrative potential, storage networking has become a make-or-break proposition for companies that are increasingly deploying converged infrastructures based on virtualization, cloud computing and a host of point solutions that are all aimed at turning big data into big bucks.

The runaway growth of data has been a reality for many years. Containing it has been one obstacle, while capturing it has been another. The challenge now is unifying the data so the growing analytics capabilities that are available can be used to share its value across the enterprise. While long-term strategies are being formulated, short-term tactics are stemming the tide. These tactics include using de-duplication, snap-shotting, thin provisioning and other technologies deployed in a configuration based on solid-state disks (SSDs) and multiterabyte disks at the performance level, and hard disks at the capacity level.

Meanwhile, tape, the ancient mariner of storage networking technologies, continues to provide a shifting value proposition. It's being used less and less as a backup tool or for active archiving because it makes more sense to use disk storage and de-duplication when backup periods do not exceed 30 to 60 days. However, tape still works in cases of long-term data preservation. For example, medical X-rays may originally be rendered to disk storage for two to five years, but 20 years later, only 1 percent of that data may still be valuable. But no one knows which 1 percent that is -- so it's all housed on tape.

According to Rick Villars, vice president, information and cloud at IDC, tape has lost much of its value because once data is put on it, there's no way to fix the data's exact location. Now, Villars notes, there's a technology called Linear Tape File System that enables users to maintain the data's context by placing the metadata at the beginning of the tape.

Virtualization Domination
Villars says that during the last four to five years, storage networking has been dominated by virtualization and the movement toward converged infrastructures. He adds that virtualization is building on its consolidation legacy by moving into areas such as virtual desktop infrastructure (VDI) -- which he says is moving out to mobile apps -- and playing a prominent role in other higher-performance datacenter applications.

"Now a lot of the technology issues and the investments in storage networking are really being driven by how aggressively people are moving into this converged model, but it's still in the early days," Villars notes. "The people who are moving into it are the very largest players; it's the cloud services providers, the larger enterprises and institutions like financial services companies -- the people who have been the most aggressive adopters of virtualization."

Villars talks about virtualization users increasing their traditional virtual machine (VM) density levels in storage networking environments. In the case of average deployments, that could mean going from eight VMs to 12 per physical machine, while in the case of larger users such as cloud service providers, density might climb to 20 to 30 VMs, which also includes a support infrastructure with all the necessary network, I/O and storage capabilities. The third choice is not increasing VM density, but rather balancing workloads so that users can run high-performance SAP or database applications without requiring excessive infrastructure support.

"You basically have companies that say, 'I want to be able to accommodate those different types of workloads using not a storage or networking foundation, but a common IT foundation, and I want the network storage and server elements to be much more tightly integrated,'" Villars says.

"While all that's going on," he adds, "a lot of those big enterprises are still maintaining their very large Unix and mainframe systems, and within that space, they absolutely want their existing Fibre Channel SANs to continue to grow and scale for that very dedicated workload."

Multivendor Fibre Channel SANs have a long history of problems relating to interoperability, shared storage and scalability that still persist to a lesser degree today. The interoperability issues revolve around resyncing these complex systems after patches are added, while the scalability and shared storage problems are tied to the monolithic nature of SANs and their inability to handle the high numbers of workloads common to VDI environments. A decade after their high-water mark, three major SAN vendors remain: Brocade Communications Systems Inc., Cisco Systems Inc. and QLogic Corp.

John Webster, senior partner with the Evaluator Group, is a veteran SAN observer. In his words, "I have users who argue that multivendor SANs are still complex, and that's just the nature of SANs -- but the interoperability issues seem to have gone away."

Villars says the biggest change in the traditional storage networking arena has been the growing shift from technologies like Fibre Channel to Ethernet for data network traffic. Serial Attached SCSI (SAS) is also gaining prominence as it's increasingly used for hardware interconnects. SAS is a performance improvement over traditional SCSI because it enables up to 128 multiple and multifarious devices to be connected simultaneously with thinner and longer cables. In addition, its full-duplex signal transmission supports 3.0Gb/s, and SAS drives can be hot-plugged.

Villars notes that most of the significant storage networking technology changes have been in storage systems as opposed to storage networks. He says the prior generation of storage companies like Compellent, EqualLogic and 3Par have been bought and are competing with a new generation of companies like IceWeb Inc., which makes unified data storage appliances. Another new competitor is Starboard Storage Systems Inc., which is dedicated to helping small to midsize businesses manage mixed-storage workloads including unstructured, virtualized and structured data. He also notes the presence of content-oriented companies such as Amplidata (cloud storage) and Scality (large-scale storage management and infrastructure).

"So the storage market is becoming more diverse because you have to specialize," he notes. "'Do I support highly virtualized compute environments, or am I focused on dealing with big archives and rich content?'"

I/O Virtualization
I/O virtualization is another tool for improving storage performance, and Villars says it was created to compensate for inadequate disk drives. "In a lot of your higher-performance transaction type applications, the dirty secret of storage and storage networking for a long time was the disk drives," he says. "They actually haven't improved in terms of performance for more than a decade. They've increased in size, but they can't spin any faster, so there's no real way to increase their performance. So what people have done is add more disks, add more heads -- but even that wasn't enough."

In an effort to rectify the situation, some IT shops started saving data on only the outside 10 percent of disk drives, which minimizes head movement. Via this technique -- known as short-stroking -- these shops were willing to maximize I/O even if it meant wasting between 50 percent and 90 percent of the disk system capacity. The arrival of SSDs solved this problem, because one SSD produced the same I/O as 10 short-stroked hard disks.

This apparent success story created another concern, however, because a lot of storage systems weren't intelligent enough to take advantage of emerging solid-state technology, which meant manual intervention was required and made it difficult to properly tune systems. "That's changing now, and that's what a lot of I/O virtualization is about," Villars says. "Can we make it so the system -- the storage and the network together -- are smart enough to determine that some pieces of data need to be on solid state, while others can be put out on slow-moving disks without manual intervention?"

The Implications of Cloud
The advent of cloud storage has brought with it a vast new repository with encrypted links that can guarantee the security of data as it moves back and forth between users and cloud service providers. What it can't guarantee, of course, is the security of Tier 1 data in the cloud, and this is a big roadblock to widespread cloud acceptance.

Webster echoes many experts when he says most companies are unwilling to base their Tier 1 data in the cloud. "It depends on the philosophy of the company," he says. "It's common to see smaller startups that are developing Web-facing applications spin them up in the cloud. It's less common to see established Fortune 1,000 companies do that, but I think they're starting to experiment with that as well. Certainly, their data governance policies don't allow them to keep primary data in the cloud, particularly that which is customer-sensitive."

Webster believes the best cloud-based storage has an application wrapped around it, because while it looks like an application in the cloud, it relieves internal storage admins from having to keep track of it onsite, which is a major infrastructure expense.

"I believe that even though the cloud has still got its security concerns, cloud providers are addressing them, and there's a lot to be said for the future of cloud-based storage," Webster notes.

When Villars is asked about storing Tier 1 data in the cloud, he says it's happening "within certain limits," and adds that there are newer companies, such as Netflix Inc., that have built their business models on the cloud. He then goes on to say that many traditional enterprises that have no interest in the cloud model may do something that many people might view as more radical: "What we're seeing is, they're coming back to datacenter discussions and saying, 'Well, do I build a new datacenter with these new technologies and converged infrastructures, or do I look to my service provider or a hosting service provider to basically build and run a new datacenter for me?' That's the private cloud hosting model that's emerging."

SNIA Cloud Focus
Wayne Adams is chairman of the Storage Networking Industry Association (SNIA), which develops and promotes standards, technologies and educational services to empower the management of information. SNIA holds conferences in the United States and Europe, and it has 150 unique member organizations with more than 4,000 volunteers. Although it's primarily composed of vendors, it also includes user companies.

In October 2009, SNIA announced the formation of its Cloud Storage Initiative (CSI), with the purpose of fostering the growth and success of the market for cloud storage. In addition to coordinating all cloud-related activities in SNIA, CSI is responsible for managing many cloud-related programs, including those for education, marketing, technical coordination with the Cloud Storage Technical Work Group, business developments and cross-industry efforts in cloud standards.

CSI highlights a technical standard called Cloud Data Management Interface (CDMI). CDMI is based on metadata pertaining to a set of parameters that become instructions to service providers describing how data is managed once it's placed in the cloud. It deals with how many copies of the data are kept, the service level provided from a latency standpoint, if there are restrictions in terms of who can modify the data, and geographic constraints in terms of geopolitical boundaries.

"That's what the SNIA's CDMI technical specification is all about," Adams notes. "It's to enable hybrid cloud computing as well as cloud brokering, where cloud service providers can procure services from other cloud service providers."

Big Data Definition
Big data, like so many new IT concepts, is subject to multiple definitions. As Adams notes, "If you talk to five different people you'll hear five different answers. Some people say, 'Geez, I've got a lot of data. Does that mean it's big data?'" He puts it in perspective by explaining that people who have been working in high-performance computing understand some of the constraints associated with moving terabyte datasets across geographic areas in order to federate computing across different sites.

"So the big growth in big data is in the enterprise, with data analytics driving the business," Adams says. "Now there's a need to continue to mine all types of customer data that's being collected on a daily basis, and be able to mine it for additional ways to serve customers, or expand the market with new products and services."

John Deere, one of the world's biggest manufacturers of agricultural machinery, is a good example of a company that's dedicated to making the most of its big data. It understands that the farming industry has become highly specialized, which creates intense competition in various market niches, like planting and harvesting. According to Adams, John Deere uses its tractors as a primary point of presence for collecting and generating data for everything from the geographic location of the tractor, to taking real-time soil and temperature samples, to the composition of fertilizers.

The challenge with all this valuable, raw data is getting it back to John Deere's headquarters in Moline, Ill. To make that happen, the company is considering placing the data in regional datacenters or hosted clouds, from which it can be accessed in waves for post-processing. Then it can be brought back to Moline, where it can be used to run programs that will, for example, tell farmers how to best maintain their equipment.

"So how do you get all that data on a real-time basis, and do something useful with it?" Adams asks. "How do you collect the larger data, where do you process it, how do you store it, and how do you move it in an efficient manner without building new datacenters close to where the data's being generated? That's where big data in the enterprise is at."

Coca-Cola Builds the Ultimate Soda Jerk
The Coca-Cola Co. Freestyle System is another excellent example of a cutting-edge big data implementation that feeds a back-end analytics engine. As Villars describes it, Freestyle is a Coca-Cola stand, a new service fountain that serves up 125 different combinations of Coke products, including lemonade, Coca-Cola, Sprite and Dasani water. The big data value-add comes from the real-time sensors attached to each of these systems that record every purchase, along with the time of day, ambient temperature and location.

Freestyle enables Coca-Cola to build real-time preferences for different soda brands and flavors. The system is linked to production so that when the company wants to launch, for example, Coke with orange flavor, it can do so on an informed basis because it has tested the new soda in various geographic locations to see where it's most popular. There's also a link to logistics, so if any drink has a spike in demand, the company will, for example, restock the shelves at Wal-Mart with peach-flavored Dasani if that's the hot product of the moment. There's also a promotional component designed to pump up product sales if the analytics engine decides a market area is primed for growth. If promotional efforts fail, that information is also filed away for future use.

"Coke recognizes that the whole point of all these investments is to be more agile as a business," Villars states. "They know they must be able to capitalize on developments in mobile technology, analytics technology and, ultimately, integration into the business process."

It's All About the Datacenter
Villars emphasizes that the key challenge for storage network users is recognizing that storage networking is becoming increasingly integrated into the "broader datacenter discussion." This discussion is about how, in the converged environment of virtualization, cloud computing and rich content, key decisions are not being made at the network, server or storage levels, but rather at the datacenter level. He says the implications of this transition will have a significant impact on traditional datacenter staffing, and storage networking admins need to recognize that parts of their jobs are going away -- they're being automated, and they're being integrated into new platform technologies.

Villars goes on to say that the ongoing value of storage admins is dependent on their ability to get data moving between datacenters -- between pools of analytics, compute and content. Data needs to be moved, managed, secure and available. "If you're a storage network administrator, that's where you should be paying attention to your time," he says. "'How am I tuning the network, and what kinds of systems am I putting in place to ensure the least amount of friction when it comes to moving data around?'"

Webster advises IT to immediately start evaluating its storage environments with a three-to-five-year horizon in mind. That means figuring out if current data protection, long-term storage and growth management plans are adequate, or if they're going to break down over that time frame. In order to illustrate the challenges facing users, he explains that his company has clients who are looking at storage resource management apps that have been around for years because they're coming to the conclusion that the latest technology is becoming too complex for them to handle.

"That's one way we've seen people address this," he says. "Another is to buy a new datacenter, believe it or not, because the existing one is just not going to accommodate growth. The attitude is, 'We can patch it here, and we can patch it there, but at some point, we have to take a step back and admit we can patch this thing forever and still not accomplish what we need to accomplish, so maybe we just ought to start over, in a way.' We're actually seeing a lot of that."

While they plan their long-term strategies, Webster advises IT organizations to utilize more modern storage arrays built around de-duplication, compression and single instancing, while also deploying snap-shotting, thin provisioning, automated tiering and data migration -- techniques he says may only "forestall the inevitable" in some cases. He says companies that are planning new datacenters should be certain to include these solutions in their infrastructure designs.

Storage networking technology is a moving target, and at this point it appears that the movement is toward convergence. That means IT organizations need to evaluate the impact of virtualization on their datacenters and decide whether those datacenters should be replaced. It may be that a services-oriented approach is more cost-efficient. One way or another, the big payoff is a storage networking strategy that deploys analytics toward the goal of unifying big data and sharing its value across the enterprise.

Featured

Subscribe on YouTube