Why can’t we be friends?
When I think about the relationship between most application owners and IT operations folks, a certain song comes to mind: Why can’t we be friends, and why can’t we? Though our focus is on different ends of the stack, our ultimate goal is the same, that being: delivering reliable service that meets or exceeds the expectations of the end-user.
The real question is, if the goal is evident, and it should be, why is there such a misalignment in the way that infrastructure folks and application owners think about delivering?
The call comes in – The website is down…
A call comes down. The websites are not responding well. Customers are experiencing lag when loading pages and the shopping cart is freezing up when customers are trying to check out. The business is losing money, and the shareholders are losing faith because this is not the first time this has happened.
What happens next is something that happens all too often and something that could be avoided but unfortunately is not…
The application owner blames the operations guy. “If you just gave me the resources I asked for, this would never be happening,” he says.
“Are you kidding me? I gave you way more resources than you needed and that is why your application is running slow. I have all this CPU ready queue dragging on my VMs ankles, weighing them down and impacting the performance of my entire cluster,” the IT operations person fires back. “And that memory that you claim to need, is just getting cached by your database, you’re not even using it!”
But before anyone can figure out exactly what happened the usage drops, response time comes back up, and nothing is resolved. Both parties walk away feeling resentment towards the other.
I just want to write code!
If I am an application owner, all I care about is having a “little black box” that allows me to do what I need to do. I should be able to write, enhance, and maintain code. I should be able to build, deploy, and do what I need to do. With the exception of having a say on the size of my VM, I don’t really care about what is happening underneath because it doesn’t affect me, or shouldn’t…the problem is, it often times does.
I am forced to care and monitor performance of my application because I think it’s being affected negatively from underneath. I’m not entirely sure, but I’m pretty sure (not knowing forces me to point fingers).
Meanwhile in IT operations…
In operations performance is managed by thresholds I have put a whole lot of time, effort and thought into so I can get the jump on problems. I quickly fix those problems before the application owner, or end-user finds out. Sometimes. When usage goes up I get a lot of alerts but a lot of those alerts are false positives, hence, the failure is never pin pointed. As consequence I always get blamed. ALWAYS.
If nothing is changed it’s only going to get worse…
The problem is the complexity of the infrastructure is exploding due to the demand for applications and the rate that they are now expected to be built and deployed (think continuous integration and agile DLCs), to keep up with customer demands and expectations.
This then leads to application owners, who shouldn’t care or understand the lower layers of the stack, and virtualization to begin monitoring their performance, and working against, instead of with operations.
What does that look like?
If I am an IT operations guy, and I am looking at memory – typically the most constrained resource in the datacenter – the metrics that are provided to me are from the hypervisor, being ‘active guest memory’ and ‘consumed host memory.
However, the memory metrics I am looking at if I am an application owner or a DBA aren’t the same metrics, being ‘heap’ size and DBMem.
That being said, it makes sense as to why there are so many battles on how to properly size a VM. We want to make sure it has enough to promote best performance, but not too much so we are being wasteful, and in the case of CPU ready queue, ballooning and swapping, actually negatively impacting the performance of the guest. How can we agree, if we are unable to relate to what I am seeing from an infrastructure perspective to what you are seeing from an application perspective? We can’t. And until we have the ability to do so the argument will continue…
It’s not just you…
I think it’s important to point out that this fire drill is something that is not specific to any organization or vertical. It is, however, specific to organizations who have invested heavily in virtualization, and cloud computing.
The complexity at which we now find our virtual estates is hard to believe, and our role as IT professionals has never had more of an impact on the financial well-being of the business, and that impact is only becoming more so a reality.
Let’s work on it!
Because of the rapidly growing complexities of our virtual estates it is paramount that both development and operations start seeing the stack, and understanding how the size of a VM effects performance of an application. How do we take what I am seeing from the VM and what you are seeing from the application and use that information to, not combat one another, but maximize out outputs?