What are alerts for?

August 29th, 2014 by

A prospect recently asked us if we could help them troubleshoot faster. Their estate of ~70 hosts and ~500 VMs was throwing hundreds of alerts in vCenter and Foglight. The team spent significant time troubleshooting to decide what remediation to take.

Their list of requests/questions included:

  • Tracking disk latency
  • Track Kernel command latency (KAVG)
  • CPU Ready Time
  • Track vMotions
  • Compare utilization over time period
  • Etc.

What data do you track that will help us troubleshoot faster?

Why do you need more data about your environment?

We’ve learned over time which alerts to pay attention to. But those alerts take time to investigate. If VMTurbo could give us smarter alerts, or give us better data, then we could restore service faster.

What if we could control your environment and prevent the degradations from occurring in the first place?

The prospect was surprised by our answer. But this is what VMTurbo does. It provides specific actions that (1) remediate existing issues, (2) preventatively reduce risk, and (3) safely drive higher resource utilization. When automated, these actions prevent performance issues.

We hate alerts too. The rhetorical question is, “Do you care about performance?” (Everyone cares about performance.) So why are you waiting for alerts to tell you when you’re not performant?

Not all alerts are good, and often can distract you from solving the real problem. Furthermore, with all of the moving pieces, dependencies and relationships in today’s datacenter, how you know which alerts to ignore and which ones to pay attention to?

vmturbo remediation actions

Let’s look at an example that came up in our discussion with the same prospect. They had been getting alerts for one of their SQL VMs because the avg. vMem utilization for this VM was ~27% and the peak vMem utilization over the last 10 days was ~89%. This database was backing a key application for their business, and they had set up a tolerance threshold alert in vSphere which was triggered if vMem utilization crossed 80%. “What we should do about these alerts?” Nothing.

VMTurbo’s Operations Manager discovers the complete VMware topology, VMs, hosts and associated datastores. It recommends (and, with full automation, will automatically execute) actions to assure workload performance while utilizing the environment as efficiently as possible.

VMTurbo’s market based approach models VMs as buyers of virtual resources (compute, CPU, ready queue, latency, IOPS, network, etc). Hosts and and datastores sell these resources to the VMs. The price of each resource depends on the utilization / scarcity of it. Highly utilized resources are more expensive. The VMs shop for the best deal, and will move to another host or datastore if they can get their resources more cheaply there.

As the SQL server consumers more vMem, the host increases the price of vMem for all VMs running on it. The SQL VM (or any of it’s neighbors) may shop for other hosts. But, at the same time, they are also consuming CPU, storage, and network bandwidth — and they must get a good deal for the total basket. Don’t do anything about the alerts because the VM will move if it needs to.

Sometime the best thing to do is to ignore these alerts (or maybe just turn them off) and go worry about more important questions.

Triangle: Performance

This article is about performance. Read more like it at the [Performance, Efficiency, Agility] series.

Leave a Reply

Your email address will not be published. Required fields are marked *