Is firefighting your idea of datacenter management?

October 29th, 2014 by

Would you rather your datacenter be healthy or broken? It seems like a stupid question right? Who in their right mind would want a broken datacenter? Nobody of course. Or maybe there are some people who don’t mind having a broken datacenter.

Maybe not always but what if you can fix it fast? Just having everything down for an hour or two would be okay right? After all going into “firefighting mode” is fun and exciting. I hope at this point you think I’m crazy, but this is what I see EVERY day. People who approach datacenter management by being okay when things are broken as long as they can fix it quickly. The thought is that if I can buy the best virtualization management tool on earth, and find PROBLEMS early then I can fix things faster.

AND, that my friends is the fundamental flaw with every monitoring approach. We’re looking to solve the problem faster and reduce the time to resolution. But, what is the end goal? Are you really looking to reduce how long it takes to solve a problem? OR maybe your real goal isn’t to fix problems faster at all. Maybe the goal is to reduce the time to failure to deliver a service or minimize performance degradation. Reduce the amount of time that your environment is broken in the first place. Now doesn’t that seem better?

Think about it for a minute. You don’t drive your car till it overheats then think about checking the oil. Instead you change the oil regularly in order to prevent your car from breaking in the first place. Now apply this same logic to datacenter management. Instead of waiting for something to break then going out and trying to fix it, try conducting preventative maintenance so that you can go longer without fixing it. That’s the difference between mean time to resolution, waiting for something to break then trying to fix it as fast as possible, and mean time between failures to deliver a service or performance degradations, extending as much as possible the time between two things go wrong.

As much as you may try to do this today it’s physically impossible under prevailing ideology. I frequently get on with people that have written scripts to say if memory utilization goes above 90% take action X, and if CPU goes above 85% take action Y. Or you could knock those thresholds down to memory of 80% and CPU of 75% and call it proactive. But, in any situation this is a patch not a solution. It fixes the problems faster instead of preventing the problems from happening in the first place.

datacenter management

To truly reduce the time between failures to deliver a service and minimize performance degradation requires a different way of thinking. Instead of defining where you don’t want to be and then waiting to do something until you get there define where you want to be and take actions to keep you there. That is the only way to keep things continuously running smoothly in your datacenter.

A global medical devices manufacturer recently came to me with this request, “I would like to get out of firefighting mode.” It seems like a simple request. It’s something that multiple vendors promised him they could help with. They spent hundreds of thousands on tools in order to help identify problems “better and faster,” but none of them lived up to the dream of getting out of “firefighting mode.”

Then the company implemented VMTurbo. After taking the initial set of 200+ preventative recommendations they realized something. They realized that the environment was now in a healthy state. The hundreds of alarms from SCOM stopped. By automating decisions within VMTurbo they were able to take over 50 preventative actions a day in three global datacenters, without human interaction. The team now spends time planning for and implementing their next generation cloud architecture as opposed to firefighting and attempting to manage Hyper-V. They still leverage SCOM for serious infractions and as a centralized point of management. All while VMTurbo runs in the background and keeps everything in a desired state.

This is what VMTurbo does. Instead of finding the place that you don’t want to be, we define the place you want to be. We take the time to understand all the workloads and interdependencies within the environment. Then, we make the decisions for you, in real time, to keep you right where you want to be. As your workloads change VMTurbo continues to make the decisions needed to keep you in a desired state. A state where all your workloads get all the resources they need. No alerts, no angry end users.

Triangle: Performance
This article is about performance. Read more like it at the [Performance, Efficiency, Agility] series.

Leave a Reply

Your email address will not be published. Required fields are marked *