5 Ways to Measure Virtualization Efficiency
The typical IT gauges don't translate well for measuring virtualization efficiencies. You'll need to look at it from a different angle.
- By Andrew Hillier
Many organizations are well down the path of virtualization and are reaping the benefits that higher agility and VM density can provide. But even with this first wave of gains, many are starting to ask whether there is more efficiency to be had, and whether they are hitting the mark when it comes to hardware purchasing and VM density.
Unfortunately, this is a difficult question to answer, as there are many factors that determine how many VMs can safely be hosted in a given environment, and therefore how efficiently they can be hosted. To make matters worse, trying to determine this based purely on resource-specific measurements such as CPU utilization and I/O rates can actually take you in the wrong direction.
The reason for this is that individual workloads have needs and requirements that determine how they make use of resources, and the way they interact can be very complex. This means that not all capacity can be effectively used, in much the same way that a swimming pool cannot be completely filled with swimmers. Swimmer size, what activity they are doing, willingness to be close to one another, and pool rules all dictate how many swimmers can fit in.
In datacenter terms these factors form policies that govern how different types of workloads are hosted. These policies hold the key to unlocking higher efficiency. For example, the actual utilization of an environment, in CPU or other terms, is not directly reflective of how well it is being managed. Instead, it must be interpreted along with many other factors to determine how much infrastructure is needed to safely host the workloads in accordance with operational policies.
Determining this true infrastructure requirement is key, as it is the most direct path to optimizing efficiency. It is also the foundation of a valuable metric: the ratio of the minimum number of servers needed to the actual number of servers in use. The "fully-loaded" metric -- reflecting all applicable policies and constraints -- provides a measure of "fully-loaded utilization," or the effective use of capacity in an IT environment.
Armed with this approach, it is possible to view the efficiency of clusters in a new light, and it becomes clear why relying on measurements such as CPU utilization is not wise. The top five reasons for this new approach are:
1. Cyclical Usage Patterns
Cycles can occur at many different time-scales, and can have a profound impact on hosting requirements. For environments with highly variable time-of-day loads, all available capacity may be required for a short period of time (such as start of day), but the average CPU utilization over an entire day may be very low. It is not uncommon to see cases where average CPU utilization is 5% or lower, but because all of the capacity is needed, the fully-loaded utilization is 90-100 percent. In this case, using average CPU utilization is not only inaccurate, but is it actually a liability, as it makes IT groups look like they are doing a poor job when they are not.
One can argue that, in this case, peak CPU utilization would have been a better measure, but this is also a case where things may not be as they appear. If the environment is subject to seasonal variations, then it may be sized to support the busiest day of the year, and utilization may be low the rest of the time. In this case, peak CPU utilization will also look low, even though it is not possible to operate with fewer servers.
2. High Availability and Disaster Recovery
Adding a level of resiliency to IT environments is one of the benefits of virtualization, but this too can impact efficiency. For "N+1" high availability (HA) models, the impact is reasonably clear, and the utilization of the environment will drop in proportion to the amount of capacity being held back for failures. But other scenarios start to completely obscure the link between utilization and efficiency. For example, if an environment is designed to survive failures at the cabinet level, then workloads must be placed in such a way that the failure of multiple hosts (those in the failed cabinet) will still allow the remaining hosts (in the surviving cabinets) to pick up the slack. This will force utilization levels even lower, further detaching them from actual efficiency. The same is true of disaster recovery planning, where policies may require an environment to have idle capacity in order to support failures. In this case, spare infrastructure is needed to safely run the business, even though it is not utilized.
3. Resource Reservations
In environments where reservations are used to prevent workloads from competing for resources, the density of the environment may be limited by those reservations, and not the actual resource utilization. This is another form of policy, and will often cause a cluster to become full even though the resources of the cluster are not heavily utilized. For this reason, many organizations avoid reservations, as they interfere with the free flow of supply and demand, and can reduce the benefits of overcommit. But some types of reservations are not optional, and limits on things like virtual CPU oversubscription cannot be avoided. Also, in some cloud models it may be necessary to use reservations to ensure consumers get what they pay for, as capacity is often sold in fixed increments. In all cases, there is an indirect relationship between the amount of infrastructure needed and how it is actual utilized.
4. Affinities and Anti-Affinities
Some workloads should be kept apart, while others should ideally be kept together. There are many reasons for this, but whatever the reason, the net result will often have an impact on capacity. For example, if application-level clustering is in use, or if policies are in place to separate application tiers, or to disperse directory servers, then it will be necessary to ensure that certain VMs will always be hosted on different servers. The net result is that extra infrastructure capacity may be required in order to host the VMs in a way that adheres to these policies, even though they have no impact on utilization.
Planning for future growth is always prudent, and even virtual environments, with all their agility, are subject to lengthy hardware procurement processes. This makes it important to maintain a certain amount of "whitespace" in clusters in order to accommodate short-term growth. Cloud environments have the added challenge of supporting self-service models, and should also maintain a reasonable amount of "demand buffer" in order to service unanticipated user demands. Both of these will cause utilization levels to appear lower than what one may intuitively expect.
The Right Way to Rightsize
Ensuring virtual and cloud environments are operating as efficiently as possible is a significant challenge, but the rewards can also be significant. Many assume that virtual environments are efficient, but this is rarely the case. Many real-world environments have twice the hardware they actually need, and the misleading nature of resource-level utilization metrics causes IT managers to err on the side of caution. By shifting away from these raw metrics toward more complete measures of efficiency, it is possible to right-size environments and further leverage the benefits of virtualization.
Andrew Hillier is CTO of CiRBA.