Managing Virtualization: Why Thresholds Are Always Bad [Part 1]

January 27th, 2015 by

Since the advent of SNMP and automated data collection, mankind has been comparing the last polled value to a threshold.  And while this behavior is understandable, and even laudable (compared with doing nothing), this approach is bad from both a practical and a philosophical perspective.

Let’s start with the practical:  when we set about the task of monitoring our environment, set our thresholds to some value which, when crossed, will cause an alert to be generated.  We will look at that alert and then (perhaps) take some action.   What we are hoping for is to receive an alert every time something happens for which we should take action, and that every alert tells us something which we would want to act upon.  The trick is in setting the value of the threshold:  there are really only three approaches to figuring out how to set the value of a threshold and they are all bad.

Managing Virtualization Without Thresholds

The first approach is to set the thresholds LOW, somewhere below the level of urgency.  The idea here is to get advance warning of our problem.  But when we do this, we get a ton of alerts.  Like the time I was doing an assessment of the network management systems for a large federal agency.  I was talking with the network engineers and they told me how their Exchange administrator got upset with them one day because one of his disks filled up, causing Exchange to crash.  “Why didn’t I get any alert before this happened!”, he wanted to know.  More importantly, he never wanted one of his disks to get full again without his being notified.   So, they agreed upon a threshold of 85% utilization.

Bad move.

As the Exchange administrator’s inbox began to fill up with alerts (every day) with alerts about every Exchange server disk that was at some fixed level of utilization above 85%, peaked above 85% for a single poll and then dropped back down, or was growing at .5%/yr and had exceeded 85% utilization, our intrepid hero, being an EMAIL EXPERT wrote an(other) inbox rule to put this flood of useless information into a folder where they could be systematically ignored reviewed later.

Needless to say, Exchange crashed again, but this time, he had the alert.

Setting thresholds low always results in a sea of noise.  Picking out the important alerts is time consuming, and so most people don’t do it because they’re busy picking the Exchange server up off the floor (or whatever other really urgent fire needs putting out right now).

So then our hero gets wise and sets the thresholds HIGH.  This is the second approach.  The idea here is that we are simply not going to produce the noise.  Only ACTIONABLE alerts will reach the console.  The problem now is that we have no advance warning.  It is like being the captain of the Titanic:

“Captain!  Iceberg dead ahead!  We’ll hit it in 5 minutes”

“Hard to port, helmsman!”

“But Captain, it will take us 20 minutes to turn enough to avoid hitting!!!!”

“Well, at least we know what the problem is…”

Friends, if you’re the captain of the Titanic with your thresholds set high, you’re going down.  That’s not hyperbole.  That’s the plain fact, and you know it.  You may not suffer total loss, but the high-threshold approach guarantees downtime.

It is, unfortunately, the approach most network managers take, because the sea of noise scenario is actually worse because you can miss the really important issues.  Those who take this approach typically invest heavily in tools which help them get back up and running quickly.  But these tools cannot keep critical infrastructure from failing in the first place.

Which brings us to the third approach: set every threshold correctly!  This sounds like an obvious solution but nobody does this for two reasons.  The first is because it is very, very time consuming (and tedious) to think about every threshold for every critical measured value in your environment.  If anyone suggests this project, and I have seen it suggested, it lasts for about a month.  By that time, everyone is back to putting out urgent fires.  But the second reason nobody does this is because even if you did it, the ground is continually changing under your feet: assets get re-purposed, workloads change, priorities shift.  As soon as you’re done, you have to do it all over again (and now you’re behind).   That is why nobody takes this approach.  I’m not sure why, but this kind of reminds me of the ‘evil bit’ approach to solving the problem of malicious network traffic (documented in RFC-3514).

managing virtualization - correct thresholds

From a practical perspective, it is extremely hard to set thresholds in any but the most simple and static of environments.  As a result, network and infrastructure managers spend inordinate amounts of time focused on putting out fires and recovering from catastrophes.

Which leads me to the philosophical problem of the threshold approach to IT and managing virtualization.

A threshold represents a division between two states: acceptable and unacceptable.  Dividing our environment in this way leads us naturally into binary thinking: when we are in the acceptable state, we do nothing, because the current state is acceptable.  And when we are in the unacceptable state, we take action to bring us back in the acceptable state.  But things aren’t just good or bad.  They can be very good, they can be very bad, they can be a little bad, they can be marginally ok.  But we don’t see any of this if we are thinking about the world in terms of acceptable/unacceptable.

So what’s the solution?

In my next post I’ll describe why VMTurbo’s Demand-Driven Control is the only approach to managing virtualization and won’t ever ask you to set a threshold.

Or try it for free now.

Leave a Reply

Your email address will not be published. Required fields are marked *