Troubleshooting Trouble -- Virtualization Review

Troubleshooting Trouble

Identifying problems in applications is hard now -- but it could get worse.

By Chris Wolf
03/01/2009

Many Burton Group clients have aggressive plans for virtualizing production applications in 2009. This is fueled by a number of factors: the economy, the emergence of hardware-assisted memory virtualization and the maturity of existing virtualization deployments.

Hardware-assisted memory virtualization (such as Intel Extended Page Tables [EPT] and AMD Rapid Virtualization Indexing [RVI]) has opened doors to virtualization enterprise applications while remaining within 10 percent of native performance while under a heavy load. Many organizations are at a point where they have virtualized development, test and non-critical applications; virtualizing production applications is the next logical step.

Virtualizing enterprise applications may lead to the addition of more supporting infrastructure, such as:

Physical storage virtualization appliances
Virtual storage virtualization appliances
Single- and multi-root I/O virtualization (SR-IOV and MR-IOV)
Network virtualization virtual-managed switches (such as the Nexus 1000V)

Virtual security appliances
Virtualization is all about abstraction. Where a device is physically located or the physical data path doesn't matter if you're a user or an application. That's why the idea of the internal cloud sounds great. If you're the IT guy who has to troubleshoot a data path through the cloud's vapor, that's another story. Whether it's for application troubleshooting, compliance auditing or service-level assurance, knowing the physical data path and an application's physical dependencies is a requirement.

A Toolset Wish List
Sure, there are plenty of virtual infrastructure management tools that claim to offer some of what I'm looking for, but I haven't seen one that can give me everything. Here's what I'd like to have:

Full data path visibility between an app and its virtual machine's (VM's) storage logical unit numbers (LUNs).
Full data path visibility between an app and a client endpoint, such as a PC.
The ability to troubleshoot multi-tier application issues that traverse both virtual and physical infrastructures.
Integration with orchestration tools to automate remediation where appropriate, or the ability to recommend a manual remediation.

Depending on the current complexity of your virtual infrastructure, these problems may not be too difficult. However, if you mix in SR-IOV, MR-IOV and storage virtualization, your troubleshooting tools may need to see through four layers of abstraction (including the hypervisor).

To provide this visibility, all devices in the data path would need to communicate state information to the troubleshooting or management tool, or probes would need to be inserted at critical points in the data path. Use of probes is nothing new to Ethernet and Fibre Channel network troubleshooting, and it's sensible to leverage probes as a way to provide visibility to data paths through several virtualization layers. In addition, many multi-tier production applications still run on both virtual and physical systems, or on a mix of virtualization technologies such as VMware ESX hosts and Sun Solaris Containers.

There are tools that offer a good degree of visibility today. For example, Akorri BalancePoint, Netuitive SI and Virtual Instruments NetWisdom are all effective at solving a number of problems.

Figure 1. The Akorri BalancePoint dashboard shows problems by application, server and storage target. (Click image to view larger version.)

Figure 2. Netuitive SI showing the health status of a multi-tier Web service. (Click image to view larger version.)

Figure 3. The NetWisdom dashboard, which shows I/O performance and probe status. (Click image to view larger version.)

BalancePoint and NetWisdom provide great visibility into storage networks, but don't offer similar visibility for Ethernet network resources. SI is an interesting product with unique self-learning capabilities, and it provides outstanding insight into network-related application issues. However, it doesn't offer the storage depth of BalancePoint or NetWisdom.

I keep hearing all this talk about the day of a dynamic, self-provisioning, self-healing data center. Again, transparency may be good for app owners and users, but it's not good for the IT staff that has to manage the internal cloud. So what should you do?

For starters, review your internal troubleshooting processes and revise or develop new processes to deal with application and performance problems on virtualized infrastructures. As you architect new virtual infrastructures, it also becomes important to consider factors such as network or Fibre Channel probe placement, especially if you plan to leverage tools that can take advantage of probes.

What About Self-Healing?
Self-healing technology exists today; however, many tasks must be automated with scripts. For example, suppose a particular storage policy on an array is overloaded and VMs aren't properly balanced across storage LUNs in the SAN. It would be nice to have my storage performance-monitoring tool trigger an alert that causes an orchestration tool to relocate one or more VMs to new storage via replication from a storage virtualization appliance or VMware's Storage VMotion. Of course, to automate such a task, the orchestrator would have to be fully aware of any security or compliance restrictions that require VMs to be stored on specific LUNs.

Finally, you need to hold the vendors to task. Application, storage and network awareness (between application tiers and to the client endpoint) are all traits of a good application performance monitoring and troubleshooting package. If the vendor is promising a roadmap that promises the world, get it in writing, and don't be afraid to highlight a particular vendor's weaknesses when you're negotiating a software contract. If automation is in your future, any performance-management tool will need to integrate with an orchestrator (such as HP Operations Orchestration, BMC Service Automation, Microsoft Systems Center Operations Manager or VMware Orchestrator). Anything can integrate via a script, so look for vendor integration modules that can save you both time and money.

I'm not trying to paint a scary picture. In fact, I'm a big proponent of virtualizing production workloads. Still, it's best to think about potential problems now and have solutions in place before a problem occurs. Developing troubleshooting processes on-the-fly in the middle of a major application outage or performance problem is never a good idea.

About the Author

Chris Wolf is VMware's CTO, Global Field and Industry.