Planning Primer: Pitfalls -- Virtualization Review

Planning Primer: Pitfalls

Build a solid and long-lasting virtual infrastructure around virtualization's many moving parts, including servers, storage, network, security and management.

By Chris Wolf
08/15/2008

Let's just get it virtualized, and then we'll worry about the rest ..." is a phrase I hear all too often from administrators in the field eager to virtualize. After all, tools such as PlateSpin PowerConvert and VMware Converter allow admins to convert physical computers to virtual machines (VMs) in as little as a few mouse clicks. Rushing to virtualization is always enticing. Vendor return on investment (ROI) calculators show enormous and fast returns on the virtualization investment, so the sooner an organization virtualizes, the sooner it can reap the financial and operational rewards of virtualization.

While this makes sense in theory, moving too quickly toward virtualization can often lead you into a series of pitfalls, including:

Infrastructure instability
Failure to meet required service levels
Poor performance
Reduced infrastructure security

Many organizations typically have a few naysayers that don't want to virtualize, so falling into a virtualization migration pitfall may give the naysayers the ammo they need to set any planned virtualization project back months or even years. Don't worry. I'm not going to spend any more time shouting out warnings against virtualizing. The bottom line is that with the right plan, virtualizing your IT infrastructure will give you the benefits you expect, and probably additional benefits that come from looking at IT problems differently once a virtual infrastructure is in place.

Memory Share and Share Alike?
Hypervisors that support memory sharing, such as VMware's ESX, can improve consolidation ratios when similar applications and operating systems are located on the same physical host. Memory sharing will consolidate redundant Read-only memory pages into single-instance Read-only pages. The result can be increased consolidation density as high as 40 percent.

Vendors that don't offer memory sharing will push back and tell you that such features cause slow performance. That's true, and the comparison is similar to the timing of when transaction logs are committed to databases. Scanning memory pages for redundancy can be a resource -intensive task, but resource utilization is not a big deal if the hypervisor waits for periods of low utilization for performing the memory-consolidation jobs, which is the case with ESX. So when you first consolidate, you may not see an immediate savings on shared-memory consolidation.

If you wait a week, you'll see a significant difference. You may ask, "If I have to wait a week, what good is it?" Considering that physical servers can remain online without a reboot for months to even years (depending on maintenance and patch requirements), you'll have plenty of time to enjoy the fruits of memory sharing. Also, considering that you'll continually add VMs over time, memory sharing can allow you to stretch your hardware investment further than you may be able to when running hypervisors that don't support memory sharing.

Shared resources will also place a strain on service-level requirements. Service levels are best assured by configuring resource pools, which give you the ability to pool VMs and guarantee them specific service levels relating to resource consumption.

Resource Utilization and Performance
Capturing resource utilization is an integral part of consolidation planning and is used to determine consolidated VM physical placement, as well as to disqualify some systems from being virtualized.

As a general rule of thumb, the following resource characteristics will prevent a server from being virtualized:

Specialized hardware requirements unsupported by the virtualization engine
Extremely high-resource utilization on the bare metal

Specialized adapters such as video capture cards are not supported in VMs on server virtualization platforms; however, if consolidation is a top priority, you could leverage an OS virtualization solution such as Parallels Virtuozzo Containers to consolidate the system and still allow each virtual container to access the specialized hardware device.

High-resource utilization will often prevent a server from being virtualized during the initial consolidation project. For me, high is 50 percent; if a host is using more than half its CPU resources on the bare metal, holding off on virtualizing the server is generally a good practice.

Following are guidelines to avoid performance pitfalls in each core resource: CPU, Memory, Storage, Network.

CPU
Most consolidation projects involve moving workloads from slower CPUs on older systems to faster CPUs on newer systems, so before-and-after CPU performance is never an apples-to-apples comparison. Consolidation tools capable of performing what-if analysis against new hardware (such as, "What will performance look like on an IBM x3650?") can take much of the guesswork out of before-and-after CPU performance.

Organizations are often wary of virtualizing CPU-intensive workloads such as database servers, but it is possible to virtualize these workloads as long as you don't oversubscribe the CPU. CPU oversubscription occurs when the number of virtual CPUs (vCPUs) on a physical host is greater than the number of physical CPUs (pCPUs) on that host. For example, eight two-way VMs (16 total vCPUs) could run on a four-way quad-core system (16 total physical cores) without taxing the hypervisor's CPU scheduling workload. Of course, pCPUs are frequently oversubscribed for non-CPU- intensive workloads, so adhering to vCPU-to-pCPU affinity is only necessary for CPU-intensive apps.

Interoperability must also be considered when planning the compute architecture. Today, for example, you cannot live migrate a VM between an Intel platform and an AMD platform. Even in the case of an offline migration, you still need to be careful because hypervisors do not fully virtualize physical CPUs. This is important because applications with licensing or activation bound to the CPU may require a reactivation once their associated VM is moved from an Intel host to an AMD host.

One other consideration is symmetric multiprocessing (SMP). VMs should only be assigned more than one vCPU if the applications running inside the VM can be marshaled to multiple CPUs and hence take advantage of them. Otherwise, the added CPU scheduling overhead associated with the extra vCPU could actually degrade performance.

Some migration tools don't provide the option to select the number of vCPUs in a target VM. So if a two-way physical host was migrated to a VM, the VM would automatically be configured with two vCPUs. If that happens and only one vCPU is needed, you should remove the second vCPU before powering on the new VM for the first time. Once a single vCPU VM powers on, you should verify that it has a uniprocessor HAL driver installed in Device Manager. While the VM will run fine with an SMP HAL driver, CPU performance will be slightly degraded.

Memory Loss
Memory is frequently discussed in virtualizing planning because it's the most common bottleneck. This is because many of today's hypervisors (with ESX and VMware Server being two exceptions) don't support memory overcommit.

The effect is that in a physical server with 16GB of physical memory, the total memory allocated to all VMs running on the server cannot exceed 16GB. Because VMs are sized based on the maximum amount of memory needed under their peak workload, hypervisors without memory-overcommit support will not provide the same consolidation density as those that do. VMs on the same physical server rarely need all of their memory at the same time, so memory overcommit allows the hypervisor to page some of a VM's memory to disk and allocate physical memory to the VM as needed.

A common practice is to just throw more memory at a performance problem, but that doesn't always work. Take Exchange Server 2003 for example, which is capable of using up to 4GB of RAM. Running Exchange 2003 in a VM with more than the 4GB of allocated memory would simply be a waste of resources. Be aware of the memory maximums associated with each virtualized app and allocate memory accordingly.

Finally, you should also keep a close eye on hardware-assisted memory virtualization, which is a technology currently shipping on quad-core Opteron processors. Intel's hardware-assisted memory virtualization (named Extended Page Tables) should be available later this year.

Hardware-assisted memory virtualization has shown substantial improvements in enterprise application performance in virtualized environments, especially with multithreaded apps. The bottom line is that hardware-assisted memory virtualization allows VMs to modify their own physical page tables and removes the bottleneck of shadow page table- based memory virtualization. When selecting hardware and virtualization platforms, check to see that both support hardware-assisted memory virtualization, especially if you have future plans to virtualize multithreaded enterprise applications.

Storage
I personally only deploy virtualization solutions as part of high-availability (HA) clusters, so in my opinion, shared storage is a must. When evaluating storage solutions, you need to be careful to ensure that adequate bandwidth will be available for all VMs sharing a particular storage interface. For example, suppose an iSCSI array allows you to aggregate two 1GB interfaces into a larger 2GB interface. While that's good, such a feature is only useful if you can perform similar aggregation on the other end of the data path-at the hypervisor. Not all software iSCSI initiators support multipath, so if multipath and aggregated links are needed to meet I/O requirements, you'll need to use a hardware iSCSI HBA, such as the QLogic QLE4062C.

All I/O connections-network and storage-should be redundant, as a single point of failure will impact multiple VMs on a given host. So a substantial investment in networked storage may be required, depending on what you already have in place.

When evaluating arrays, it's important to look for features that improve VM storage scalability (such as thin provisioning) and also provide serverless backup options such as snapshots.

Planning Checklist

Step 1: Conduct a thorough analysis of the IT infrastructure that includes both technical and non-technical constraints. Leverage the collected data to access virtualization platform and hardware feasibility for the consolidated environment.

Step 2: Verify hardware compatibility (server, storage, network) and software compatibility (operating systems, applications) and support for the prospective virtual environment.

Step 3: Re-architect existing business processes for compatibility with the virtualized environment before the consolidation project completes.

-C.W.

Network
Virtual networks give you the ability to allow multiple VMs to share physical network interfaces, and with that ability it's essential to ensure you're not voiding any internal segmentation restrictions when architecting the virtual switch infrastructure. Some organizations still require physical isolation of security zones, which means that VMs will need to be physically isolated on different physical networks in some circumstances.

As with storage, your consolidation analysis should provide detailed information on average and peak network I/O requirements for a given system. Keep in mind that if you plan to change your backup architecture following the migration, your network I/O requirements may be much lower (assuming an earlier LAN-based backup) and may impact the number of physical resources you need for the consolidation project.

You should also plan for plenty of network ports. Most organizations dedicate two ports as a network interface card (NIC) team for management and cluster heartbeat traffic. So as a good practice, you should not consider the two onboard ports that come with a physical server as part of the consolidation analysis. Instead, plan to leverage the server's available PCIe expansion slots for dividing up both network and storage I/O. Like with the management network, all production network ports should be teamed in order to provide both load balancing and failover support. If availability is a requirement, you should avoid hypervisors that don't support NIC teaming.

When purchasing NICs, it's always a good idea to offload as much CPU work to the NIC as possible, so it's a good idea to purchase NICs with TCP offload engine (TOE) support. TOE NICs improve VM network performance and reduce the hypervisor's CPU tax resulting from processing TCP overhead.

Beware: Sales Pitches
It's always easy to take what a virtualization vendor says at face value, but experience has shown most of us that significant differences usually exist between a marketing checkbox and a true product feature. For instance, there's a huge difference between virtualization HA that relocates VMs to the next node in the cluster following a physical server failure and one that invokes a fan-out failover that equally distributes VMs across all remaining physical nodes in the cluster.

For some virtualization products, VM failover occurs solely by logical node order. So if VMs are running on node 1 and node 1 fails, every VM will try and start on node 2. The ones that can't start on node 2 will move to node 3 and so on. With that system, failover can take significantly longer to complete. This is why when you evaluate virtualization solutions you must do so on a three-node cluster at a minimum. This will let you validate the failover behavior of the HA solution. Live migration behavior also varies by vendor, so it's equally important to validate how live migration will work on a production workload.

Virtualization solutions are not the same, so when sizing up hypervisor or OS virtualization platforms, keep the following questions in mind:

Operating system and application compatibility and support: Will my applications and OSes be supported on the virtualization platform?
Performance: Will the virtualization platform outperform competing platforms?
High availability (HA): What are the available HA and failover options?
Automation and management: How well does the virtualization platform integrate with my existing management infrastructure?
Hardware compatibility: Can I use any of my existing server, storage and network hardware with the new solution?

Vendor histories of skeptical, one-sided benchmarks have caused most to take benchmarks with a grain of salt. Virtualization benchmarks aren't much different. For example, benchmarks with unrealistic workloads (such as running only one VM per physical server) or unrealistic configurations (such as those that use RAID 0r back-end storage) should be viewed with extreme caution, as you should not expect to see similar results in a real-world implementation consisting of multiple VMs per physical host and fault tolerant back-end storage (for more on benchmarking considerations, see "Planning Primer: Benchmarking").

Hardware choice for the consolidated solution is critical as well. Virtualization deployments often involve the purchase of new server and networked storage platforms. When it comes to virtualization, scalability should be your primary concern. For example, some blade solutions will look really good for virtualization on the outside, but when you look a little closer you may see that a particular blade chassis only supports 18 physical I/O ports per chassis, for example. Consider a blade chassis with 14 blades and 10 VMs per blade and you have a total of 140 VMs sharing 18 I/O ports. In most cases, the math just doesn't work.

Not all newer third-generation blade solutions have such chassis I/O restrictions, so by no means are blades always a bad choice for virtualization. Some organizations prefer to go with very large servers in order to reap the rewards of a high-consolidation density; however, you still need to consider the impact of physical system failure and failover time.

In some cases, it's better to go with a lower consolidation density on 2U servers and have quicker failover than to go with 4U servers and a prolonged failover. Of course, all of this falls into the proverbial "it depends" IT assessment, as I've successfully deployed blade-, 1U-, 2U- and 4U-based servers in virtualization projects. 1U servers may not make sense for enterprise workloads, but are often a good fit for a few VMs running in a branch office. Take a couple of 1U servers and Stratus Avance or Marathon FT, for example, and you have a very cost-effective branch office virtualization solution.

Don't Panic! But Fear the Shortcut
There are plenty of tools that make it extremely easy to convert a physical system to a VM, but just because it's easy to get virtualized doesn't mean that you should hurry up and do it. Instead, taking the time to properly plan around virtualization's technical and non- technical pitfalls is the safest way to ensure you'll successfully realize the benefits of virtualization without any ensuing panic.