Waiting on the dreaded CPU ready queue

October 13th, 2014 by

There I was… just me, my oversized monitor, and the dreaded queue. This wasn’t the type of queue that one may typically encounter on an early Monday morning at their local Starbucks, or even the type of queue that I experienced last weekend while waiting to board the Superman ride at Six Flags New England. No, this type of queue was far worse.

dreaded ready queue

The “queue” that I am referring to of course is CPU Ready, or as VMware so eloquently puts it:  The time a virtual machine must wait in a ready-to-run state before it can be scheduled on a CPU.

While recently working with a customer suffering from abnormal levels of CPU ready in a few of their clusters, I found myself in the midst of a conversation that we are too familiar with in this industry.  You know the drill: Application owner X/developer Y requests a virtual machine with specifications of astronomical proportions.  See, they’ve been down this whole virtualization route before at previous organizations.  They know that the easiest solution to achieving optimal performance on their given platform is to request (as one of my colleagues describes it) “the biggest, baddest, virtual machine.”

Being a seasoned VMware engineer, our customer shared that he’s tried explaining to his team numerous times that although on the surface it may seem logical that a larger virtual machine will lead to superior performance, it’s not always that simple.  That although they are indeed seeking virtual machines of larger specifications to assure performance to their workloads, they themselves might be indirectly causing degraded performance to their own virtual machines!  C’est la vie.  And so, the rightsizing conundrum continued…

dreade ready queue 2

But it’s not as simple as just rightsizing is it?  I’d venture to guess that the majority of individuals who manage virtual infrastructures know the most obvious solution to eliminating excessive %RDY is to properly size their virtual machines, but sometimes the most obvious solution isn’t necessarily the easiest.

Perhaps a more logical way to approach this challenge would go something like this:  Begin with the characteristics and demands of the virtual machines, assess the underlying supply from the infrastructure, and decipher where these virtual machines should be placed to get access to the resources they require.  Of course, we’d have to assess how these virtual machines should be sized, but only if placement and overall capacity were considered as well.   It makes sense, but doesn’t sound exactly easy.

Think about it for a minute.  Even if I was to dive into one single virtual machine and look at one possible resource decision it becomes complex.  If I decide to move a virtual machine, where do I move it?  When do I move it?  When I do decide to move it, what impact will that have on the destination ESX host?  What impact will that have on all of the virtual machines already fighting for resources on that destination ESX host?  It’s easy to see how the complexity behind making this one single decision can quickly spiral out of control, and we haven’t even tackled overall capacity or the sizing of the virtual machines yet.   That’s because in a shared resource in environment (i.e. virtualization) it’s not humanly possible.

But what if there was a way that we could understand every implication of any resource decision we are about to execute, before we execute it?   Imagine a workload management system for the virtual environment that interprets the demand of all the virtual machines, decides how each of them should be placed and sized to match their needs with the underlying infrastructure supply, and recommends exactly when to acquire additional capacity should the equilibrium of supply and demand not be achievable.  That sounds like true control.

dreade ready queue 3

I would imagine that this type of control system would help our customer from earlier alleviate their pesky CPU Ready challenges.  I also don’t think it’s out of the question to assume that increased performance due to this control, would help to generate trust in operation’s (or the control system’s) recommendations around resource requirements.

Now take this a step further and imagine a scenario where application owners are granted access to see their specific virtual machines and exactly what they are consuming from a resource standpoint:

dreade ready queue 4

By allowing members from other teams to gain an understanding of how their resources are being utilized, we can effectively bridge the gap between development and operations while still leveraging our control system on the backend to maintain the environment in a desired state. A state in which performance is assured and the infrastructure is being utilized as efficiently as possible.

Triangle: Performance

This article is about performance. Read more like it at the Performance, Efficiency, Agility series.

See Operations Manager In Action

Leave a Reply

Your email address will not be published. Required fields are marked *