On my previous post I covered what makes managing and controlling cloud environments so difficult. The overwhelming amount of configuration options available together with the constant fluctuations in demand makes it extremely challenging to optimize cost and performance.
What does a successful digital transformation look like?
Our customers migrate to the cloud for many reasons, but the most common ones are Agility, Cost and Elasticity but they rarely accomplish those. To truly ACE the cloud and unlock the value that drove them to the cloud is more challenging than originally anticipated.
Businesses today, and IT departments as a part of them, are constantly asked to do more with less. In the majority of datacenters today it takes 6-8 weeks to provision a new VM. IT has slowly become the bottleneck for innovation, significantly increasing time to market. With the pace the industry is evolving it became vital that organizations operate fast, innovate constantly and be as lean as possible while doing it. Digital transformation efforts have been put in place exactly for that reason.
With Agility empowering the end users to be independent and self-sufficient they are free to operate at the speed of creation and are accountable for their own resources. Elasticity ensures only the resources currently needed are being paid for, eliminating over-provisioned resources and dynamically adjusting to changes in demand, whether seasonal or unexpected. With these two in check our customers believe the cost of IT would drop significantly.
The reality though is that achieving these is extremely challenging and out of human scale. Most organizations just take the same practices with them to the cloud hoping it would behave better there. It’s like de-cluttering your garage, but instead of throwing the unused and unnecessary things away, storing them at a mini storage facility where you pay for the space they take up on a monthly basis.
One of our customers, a large oil and gas company, migrated 400 VMs to the cloud with exactly that mentality. Shortly thereafter they realized that they are paying about $4M per year for those applications – 6x what the cost of the equivalent applications was on prem.
The Shame Game
What makes agility and accountability so hard to accomplish? With the introduction of agile methodologies and on demand compute we empowered business units to be self-sufficient and operate faster, but at the end of the day, those application owners are not experts in application lifecycle management or optimization.
When it is easier to provision a new resource than it is to search for existing resources and extremely hard to understand what you have running in the estate, the sprawl of ghost infrastructure is massive – around 30% annually. In addition, for decades application owners have operated in an over-provisioned estate knowing virtualization allows for resource sharing and hardware is over committed on resources so the pain of over-provisioning was limited. When in the cloud, each allocated resource is paid for by the second. Every second. This mode of operation quickly becomes painfully expensive. Moreover, the app owners are now accountable for these costs.
What did we do to try to fix the problem? We invented shame-back. We threw cloud bills at app owners hoping to shame them into a mentality transformation. This resulted in a shift of focus where now anything between 10-50% of their time is spent staring at cost and resource utilization dashboards and cross referencing catalogs to try to make sense of the data and drive cost down. By doing that, not only did we not solve the problem, we shifted their focus from the core of their job to the chore of optimizing the environment, slowing them down again. Counterproductive to the original goal of increasing the speed of innovation.
Out of Human Scale
What makes elasticity so hard to accomplish? Decoupling the workload from the hardware and the introduction of public cloud providers created a possibility for organizations to expand and contract their estates based on the demand of resources, seasonality of the business and the different business cycles of each application. In on-premises datacenters we had to purchase hardware for the peak demand of resources. However optimized and dense we managed to keep the environment, the overall peak demand had to be served. Public clouds changed that and allowed us to only pay for the resources we use, when we use them. The reality though is because CSPs charge based on allocation, unless you constantly change the allocation to match the demand, nothing really changed, you still pay for peak demand only now you do it per workload.
There are 1.7 million configuration options for EC2 instances alone, 90 additional services available in AWS, and similar numbers in Azure. On top of that, in an ever changing cost landscape where 70% of the above options change within the course of a year, it is no surprise that most organizations consume public cloud almost statically. Elasticity ensures that the estate expands when demand increases, meaning applications are continuously guaranteed to get the resources they need, but contract when the demand decreases ensuring you do not pay for idle or over-provisioned resources. To achieve that you must consider each available configuration option in real time because when there are over 1 million of those it is beyond human scale.
With the above not being accomplished it is almost expected that 80% of organizations see costs over 300% what they expected. Continuous optimization, deleting idle resources and rightsizing across all resources can generate incredible savings; in some cases reaching 75% or above. It isn’t simple to accomplish the above and without intelligent software and automation I would argue it’s impossible.
Extensive monitoring and multitude of charts never helped any organization truly optimize since making sense of the data is such a lengthy process, by the time you reach a conclusion the data changed. See more about this in this post.
What Do Customers Do Today?
The easiest way to save money in the cloud is purchase RIs, so the first thing most organizations do is analyze their existing environment, warts and all, and purchase RIs for it. By doing that they commit to the overprovisioned and un-optimized state the estate is currently in for the long term. On top of that, they make it static. These RI purchases, when made incorrectly, lock the apps they are purchased for to a single configuration preventing the organization from leveraging their dynamic and seasonal behavior to drive elasticity. The immediate savings take the pressure off of optimization for a short while, but when further optimization is necessary and they attempt to optimize based on utilization the RIs they purchased no longer match the configuration of the instances.
This is exactly what happened to one of our customers. After purchasing RIs for about 30% of their workloads they experienced some savings. When they then optimized the instances based on utilization, in an attempt to further optimize, the bills just became larger – 15% higher than the original bills, prior to any optimization!
The Desired State
By understanding the full stack, all the way from the application to the instance types, SKUs, existing RI portfolio, different billing models and offerings across IaaS and PaaS for compute, storage and network, Turbonomic identifies the best configuration, for each application component, in real time, or whether the resource is completely idle and can be removed. Since Turbonomic takes every resource required by an application and considers those in every decision, it is guaranteed that these actions won’t negatively affect performance, but only increase it. Turbonomic does that in real time to ensure each application gets exactly the amount of resources it needs to deliver on its SLA. Turbonomic can also automate these actions to make sure application owners focus on the core as we take care of their chores. By doing that, not only do we empower them to optimize their environment and reduce their costs, we also increase their productivity and cut their cloud operating costs.
By implementing these at the oil and gas company I referred to earlier we were able to bring their cloud bills down by 50% with an additional 10% identified at the moment, all without purchasing any RIs – yet!
To read more about the challenges of optimizing cloud environments see my previous post here, and for how the current approaches used by the industry are failing to solve the problem see this series of blogs: The Manual Approach, Rule Based Engines and Batch Analytics.