Solving the Resource Constraints of Bully VMs

September 17th, 2014 by

When large virtual databases, EMR applications, and ecommerce applications consume large amounts of resources from their associated hosts, it places a heavy burden on the entire virtual infrastructure.  Dealing with these Bully VMs and the affected neighboring virtual servers is not only an important problem to solve, but it is very complex.  Assuring performance for all virtual servers is necessary to drive business revenue and meet the goals and expectations of end-users.  But accomplishing this is never an easy task due to the amount of variables, numerous relationships among service entities, and the number of decisions IT operations can take.

Let’s consider a customer use case with a US telecomm company.  The pain they experienced was CPU congestion within their 10 host cluster dedicated to their multi-tiered applications.  MS SQL databases were eating a significant portion of the underlying host’s CPU processing power.  As a result, VMs sharing the same CPU supply experienced delay, and could not consume the necessary amount of CPU resources as soon as their demands increased.

To solve the contention in their infrastructure, the relationships between supply and demand must be evaluated.  If the demand of CPU resources across neighboring VMs exceeds the available supply of CPU of the host, contention occurs.  On the other hand, if the available supply of CPU on the host surpasses the demand of CPU on related VMs, then areas of inefficiencies occur.  Thus, the best method to solving CPU contention (other than throwing hardware at the problem) is to use resource relationships to discover the equilibrium between supply and demand.

The first step in the solution is to examine the capacity and usage across every virtual machine and host in the cluster.  VMTurbo uses these resource relationships to quantify the exact amount of resources every virtual machine needs in real time.   This way, we can be sure that resource requirements are met without over-allocation.

In the scenario above, the high usage of CPU caused by the bully guest and the over-allocation of neighboring VMs, was causing high levels of CPU utilization and ready queuing.

High 4CPU Rdy

Once our control system understood the resource requirements for the virtual machines and where the contention lies, VMTurbo uses a combination of sizing, placement, and capacity decisions simultaneously to drive the infrastructure to a healthy state.  In the customers case above, VMTurbo could size and place virtual servers around the 10 hosts in such a way that they did not need to procure more hardware.

If we consider our ability to migrate virtual machines, it is imperative that we aren’t just moving the problem around the cluster.  In other words, if we move the Bully VM to another host, will it cause contention on the destination server?  The best solution is to consider placement decisions holistically across the entire environment.  VMTurbo determines which workloads to place together harmoniously so that the overall demand of resources is equal to the hosts’ supply.  In fact, VMTurbo will multiplex the peaks and troughs for CPU consumption to decrease the strain on the host.  Rather than placing workloads whose demands peak during the same timeframe, VMTurbo will place workloads together (matching the peaks with the troughs) so the net demand for CPU on the host is drastically lower.

Bully VM - motion

In conjunction with the vMotions that VMTurbo was suggesting to the customer above, numerous sizing recommendations were provided.  Two variations of sizing recommendations were illustrated on the dashboard of their appliance.  First, we suggested to reduce vCPU reservations on over-allocated VMs to help alleviate the CPU congestion on the host:

Bull VM - resize CPU

Secondarily, VMTurbo identified VMs whose allocated vCPU capacity was much greater than the MHz of processing power they needed. Since VMs need to access most of the physical CPU cores from the host, increased ready wait time will result when the cores were not available.  VMTurbo addressed key VMs to downsize in order to reduce ready queuing on the host:

Bully VM - decrease vCPU

Overall, VMTurbo’s solution used placement and sizing decisions to assure that all workload demands are met.  Keep in mind, this blog primarily focused on CPU constraints, but just imagine the complexity if we extend this solution method out to all the other resources such as RAM, IO, Network, Disk space, IOPS, etc.  With a thorough understanding of the relationships between big and small virtual servers, the underlying hosts, the datastores, and the hosted applications, VMTurbo can determine the best VM’s to move and size so that ALL resource demands along the supply chain are met.

Bull VM - SC

Let VMTurbo take the heavy-lifting, guesswork and trouble shooting out of the equation and leverage real time decision automation to guarantee all service entities get the necessary resources they need without being inefficient.  Stop prioritizing applications and resources so that only business-critical applications/bully VMs are granted access to the compute and storage resources.  VMTurbo can make all virtual servers happy; thus, more importantly, keeping your end-users happy.

Triangle: Performance

This article is about performance. Read more like it at the Performance, Efficiency, Agility series.

Leave a Reply

Your email address will not be published. Required fields are marked *