Nearly every IT Operations group leverages the ability to move VMs among their compute nodes multiple times in a day. Despite the differences in the naming convention (e.g. vMotion, live migration, XenMotion, etc.) we leverage this simple technique to fix performance problems or inefficiencies within our virtual infrastructure. Or at least we attempt to…
Image source: VMware vSphere vMotion
We often oversimplify migration decisions because they are quick and are low impact on the virtual environment or at least that’s what we believe. Despite such ease, sysadmins unknowingly makes numerous tradeoffs in the process. If you move a VM, which server do you move it to, how will that VM play with its new neighbors, what happens to the levels of resource usage on the host, am I still being efficient with my resources?
I was recently working with a customer who expressed the complexity of moving VMs around the environment. His trouble was that no matter how he moved his VMs around his production cluster he couldn’t fix his performance problems. His main constraint was ready queue on the hosts, and when he went to manually move a VM he would eliminate ready queue on one host but simply create it on the new destination server. Or he would move the highest consumer of CPU off one host and to another, and create dangerous levels of ballooning on the destination node. Essentially, the customer could not solve his problem, he would just move it around. His production cluster had 19 physical servers with roughly 200 VMs.
The image above depicts the customer’s environment before taking the prescribed actions from VMTurbo. Several host are experiencing ready queue.
I told the customer to think about this mathematically for a second. With 200 VMs and 19 hosts, there are well over 100,000,000 different possible combinations of VMs across his physical servers. Even if he knew that 1 perfect combination of VMs (the desired state where application performance is guaranteed while you are utilizing the environment as efficiently as possible), what happens 3 hours from now when workload demand has shifted?
That desired state has already shifted. It is impossible for a human to keep track of all this data and understand how to move their VMs without making tradeoffs.
Within 30 minutes of deployment, VMTurbo provided 24 migration decisions within this cluster to drive it to its desired state.
At the foundation of VMTurbo lies the common data model system that understands the complex relationships between supply and demand. Our system provides executable decisions with an understanding of how everything in a virtual environment is connected; and when you migrate a VM, it has a ripple effect throughout the entire environment. With this complex understanding, VMTurbo understands which VMs to move while navigating the tradeoffs between ready queue and RAM usage, between host IO and network traffic, between CPU usage and ballooning, ultimately not having to sacrifice performance nor efficiency.
The customer began to take our actions through the VMTurbo system and to his amazement, watched ready queue decrease across the entire cluster. More importantly VMTurbo’s decisions did not cause resource congestion on any of the destination servers. VMTurbo solved the performance problems and did not create other bottlenecks in the process.
A week later the customer was leveraging our system in automation for VM placement decisions. He was pleased to note that VMTurbo was able to keep the compute environment healthy even during peak workload hours. He had almost no ready queue within the cluster and had actually seen a noticeable increase in density across two of his servers without seeing performance problems.
The image above depicts the customer environment with VMTurbo in automation mode. Ready queue issues have been eliminated and the same amount of workload can be supported with two fewer hosts.
VMTurbo’s placement engine understands the application demands of compute resources within a cluster and the supply of resources available for consumption. By matching the demands of your VMs with the underlying supply of resources, our engine can drastically reduce the competition for resources and manage the tradeoffs typically made when solving for performance constraints.
By making decisions based off these relationships and not thresholds, VMTurbo can drive the environment to a state where performance failures are far less likely. After all, how do you know how hot you can utilize each of your resources without seeing an impact to performance, or will you wait for an arbitrary threshold to be crossed and react to a an alert?