New approach to IT operations management
100 years ago Albert Einstein came up with the theory of general relativity. The theory was proven at the Solar eclipse of May 29, 1919. Exactly 96 years ago today.
One of Einstein’s most famous quotes is “The definition of insanity is doing the same thing over and over and expecting different results” (which I’ve read isn’t his originally). Pretty obvious, yet it seems like sometimes we need someone to point this out for us. Specifically, it seems like in the last few years we have been managing our IT Operations in the same way over and over again, and yet… we suffer.
How do we operate IT?
We want to know that everything is under control. So we monitor. EVERYTHING. There are literally thousands of monitoring tools out there. Want to monitor your OS? Here is a list of 82 tools just for Linux (don’t worry – there are more for other Operating Systems). Want to monitor your cloud servers? Here is a list of 47 tools that will monitor specifically servers on clouds. Want to monitor your applications? Here are 40 application monitoring tools.
The lists above are nothing but a drop in the ocean of different monitoring solutions up and down the IT stack. You collect all this data so when there is a deviation from normal behavior or when a threshold has been exceeded, you get an alert.
Then your engineers examine the alert, and try to fix it. Of course, they do that WHILE the environment is suffering. Every day there is a new monitoring tool that gives better and more metrics. However, going back to Einstein, as long as we continue to feed into this monitoring break/fix loop we would never prevent problems in the environment.
Moreover, trying to fight a fire once it already started, is a problem unsolvable by software. Once you reach that point, the root-cause analysis is done by humans. That can never change. You might be able to “automate” some aspects here (i.e., If host A is suffering from too much CPU, move a VM somewhere else) However – you will never be able to get to a complete solution.
In conclusion: If we want to manage our IT operations continuously in a healthy state and assure application performance, we must change completely the way we operate. Instead, we need to start at the TOP. Figure out what “healthy” means for our environment. How do you get there? Then you can automate actions to get, and as importantly STAY there. Not only to solve performance issues, but to prevent them from happening at the first place.
Any other approach to IT operations management is insane. Just ask Einstein.