Take Five With Tom Fenton
5 Reasons Virtual Big Data Is a Big Deal
- By Mike Matchett
We'll see Hadoop as a Service before year end, but there are other reasons it's getting bigger.
Hadoop has made it possible to tackle humongous amounts of data with certain parallel computing algorithms. One of the fundamental design points that makes Hadoop so IT-friendly is that it can leverage scale-out racks of commodity servers with plain direct-attached storage (DAS) disks. This simple infrastructure brings big data opportunities within reach of everyone.
Here are five reasons I think Big Data Hadoop is a big deal.
When VMware announced Project Serengeti last year, it seemed as if it was a marketing hack, a way to show off some lab research. But consider server virtualization's march forward. Remember virtualizing those first lab and test servers, when it was considered crazy to virtualize production business applications? Then it became standard practice to virtualize mission-critical apps such as Microsoft Exchange and SQL Server. Today, even high-I/O applications can be successfully virtualized if you include in-memory and flash technologies.
VMware recently announced the beta of vSphere Big Data Extensions (BDE), a commercialized evolution of Project Serengeti that integrates Hadoop into the vSphere platform. BDE brings a number of things necessary to effectively host Hadoop in a virtual environment, including a nice admin GUI for on-demand provisioning with quality of service (QoS) management controls.
One of the great things about virtually hosting Hadoop is that clusters of almost any Hadoop distro can now be spun up and down on demand with just a few clicks. Further, multiple clusters can now share the same set of hypervisor server nodes. Under QoS control, BDE adds and removes nodes from each cluster to maintain prioritized cluster performance. Production and dev/test clusters can run side-by-side, and even access the same data sets.
Each Hadoop node is run as a VM. In fact, the compute-side "task trackers" can be run as separate VMs from the data nodes (unlike the physical implementation, where each commodity server has both). With this flexibility, compute nodes and data nodes can be scaled independently. Data nodes can be shared by multiple compute nodes and disk failures can be further isolated. This supports opportunities to efficiently use external SAN resources, too, which -- depending on data management and access requirements -- might be worth exploring.
It's High Performance
Key to performance in Hadoop is maintaining "data locality." The Hadoop Distributed File System (HDFS) maintains chunks of data local on each compute node so that parallel tasks can independently process each chunk. In a naive virtual environment, dynamic VM movement and storage abstraction (for example, virtual disk images) could seriously hamper performance. This is where the Hadoop Virtual Extensions (HVE) come into play as part of Apache Hadoop 1.2. With HVE, each hypervisor is effectively represented in Hadoop as a "node group," providing additional storage location intelligence. Data chunks stored on a local hypervisor's DAS are managed as local to all the nodes assigned to that hypervisor.
Soon it will make total sense for IT shops to offer Hadoop clusters on-demand, leveraging the same virtualized environment used for all other "modern" applications in the burgeoning IT cloud-delivery model. Certainly, larger, constant-use Hadoop clusters of the kind in production at famous Web 2.0 companies with petabytes of daily data may always be hosted on massive scale-out racks of commodity servers. But far more common are 10TB to 20TB scale big data solutions that require only transient processing by handfuls of nodes, whether for research, development or production.
Mike Matchett is a senior analyst and consultant with IT analyst firm Taneja Group.