Swap State: how fast can you isolate the problem and fix it?

August 29th, 2014 by

It is a very well-known fact that if your virtual machines are swapping memory pages to disk, then their associated applications are going to experience heavy  delay.  Even worse, swap state is one of the most challenging issues to solve in a virtual infrastructure.  Being able to identify the source of a problem is extremely difficult, and coming up with the best decision(s) is even harder.  The denser your hosts and datastores become with added workloads, the harder it is to prevent performance issues.

I was looking at a legal practice’s main production cluster last week that had critical resource management issues.   The specific constraint was around memory allocation, where RAM contention was causing excessive ballooning and swapping.  Adding to the complexity, the environment has two production clusters with six hosts each: one in Toronto and the other in the UK.  Both clusters acted as a DR site for each other; thus their HA policies forced very low RAM and CPU utilization thresholds.  As a result, they were using DRS in full automation and aggression to shed load for Memory and CPU just below 40% and 10% respectively.  All of their attention was focused on simply balancing their compute clusters, thus distracting them from their NFS storage environment and virtual machines.  In fact, the company openly admitted that they did not look at the VM layer for swapping because it was just too much data to analyze across their expansive set of monitoring tools.

A couple ESX hypervisors were swapping memory pages out to disk with no knowledge of which guest physical pages were in use.  The balloon driver on a guest OS is only intelligent enough to determine which physical pages can be pinned to the hypervisor.  As a result, active memory pages were swapped to-and-from disk forcing a handful of virtual machines to wait until the hypervisor swapped them back into solid state memory, or for the virtual machine to read directly from disk.  Both scenarios caused unacceptable levels of swap-latency, and end-users suffered.


In this particular case, VMTurbo only showed one recommendation to vMotion between two hosts due to memory.  The client was intrigued as to why VMTurbo was only suggesting one vMotion if the environment was excessively stressed for memory. The simple answer is because VMTurbo is a decision engine with the ability to leverage every control knob at each layer of the virtual stack.  Placement decisions alone cannot solve excessive ballooning and swapping; sizing and capacity decisions are necessary to fully restore health to the entire environment.

Sure enough, the customer’s VMTurbo instance had numerous executable recommendations to downsize vMem reservations and capacity to eliminate swapping, deflate the ballooning, and alleviate RAM contention.  Not only did we save them time by removing the troubleshooting/decision making process, but we were able to create a change restriction window for the following weekend directly through the instance for all sizing decisions in the cluster.  VMTurbo’s restriction windows allow users to specify a date and timeframe to take resize actions manually or automatically in conjunction with their business’s change control windows or during a timeframe when production workloads will not be impacted by a guest shutdown.

mem congestion alerts

More importantly now that the customer will be leveraging placement decisions alongside sizing actions for certain virtual machines, VMTurbo can prevent swap-state from occurring again via preventative actions.  The exact moment pressure and contention begins to build for any resource within the infrastructure, VMTurbo will take action to ensure that potential threats to QoS do not evolve into performance issues.  VMTurbo’s ability to place the right workloads together and multiplex across peaks on memory, CPU, IO, network IO, etc. prevents contention on your physical servers.  All resources and data points are trended over time and analyzed alongside the environment’s real-time data to intelligently move and size workloads to assure performance and application uptime.

Triangle: Performance

This article is about performance. Read more like it at the [Performance, Efficiency, Agility] series.

Leave a Reply

Your email address will not be published. Required fields are marked *