IT Management: Are We Chasing Yesterday’s Problem? – Part 1

January 24th, 2012 by

PART ONE: Alert Suppression is NOT Root Cause Analysis

Performing Root Cause Analysis (RCA) in IT is to accurately infer the specific problem that is causing a set of observable symptoms. It is similar to what is called a “medical differential diagnosis” done by a doctor, where the doctor will infer a patient’s disease based on the observed symptoms and a battery of tests.  This is an exponential problem because the combination of symptoms that a given disease  may cause is exponential. To conquer this Root Cause Analysis challenge, you must have a proper representation of the causality relation between all possible problems in the environment and the symptoms each of them may cause, and an algorithm that, based on the representation, can infer the problem causing an observable set of symptoms. There are medical journals devoted to this specific exercise for every practice of medicine. And that’s just for a single problem (disease). In the context of IT – and especially of virtualized IT – the Root Cause Analysis challenge is even more complex because the causality relation is continuously changing due to dynamic changes in the environment (e.g, workload motion) and there are often multiple problems occurring at a given time. We are very familiar with the complexities of this challenge – in the mid-1990s at SMARTS (currently owned by EMC) the team here attacked (and solved) this challenge using the Codebook algorithm. We applied it to distributed network infrastructures and you can read all about our approach here .

Almost two decades later a flood of virtualization monitoring tools are popping up claiming to do Root Cause Analysis. This is quite a long stretch from reality. In essence, these products are more accurately categorized as event filtering and/or alarm suppression tools. These tools do a good job of collecting hundreds of different metrics and generating alerts when a metric crosses a threshold…

“Doctor, I have a fever.”

Some tools do more and are able to trend metrics to generate a predictive alert…

“Doctor, I am going to have a fever.”

The more sophisticated tools use a variety of pattern matching methods to be able to predict alerts more specifically…

“Doctor, every morning at 8 AM I will have a fever and you can ignore it.”

The emerging sophisticated analytics tools do even more – they monitor and analyze dozens or even hundreds of metrics per virtual machine (VM). When several of the metrics are out of norm, these tools aggregate it into a single message about that VM. For example: instead of alerting that memory utilization is above threshold, CPU utilization is above threshold and IOPS is below a threshold – they alert that the VM has a problem. It is as though you went to the doctor with a fever, stomach pain, headaches and high blood pressure and the doctor tells you that you are sick (um, yeah…).

Alert suppression (alarm filtering, suppression and aggregation) has value, but it is NOT Root Cause Analysis. In effect, all it does is to reduce the amount of information you have to deal with and react to given issues in your virtual environment. It does not really get us any closer to HOW we should best operate our virtualized environments. We can do better! We have to do better! And the real question we need to answer is: to troubleshoot or not to troubleshoot?

I’ll address that in my next blog– stay tuned…

Leave a Reply

Your email address will not be published. Required fields are marked *