Containers are today’s modern application standard, enabling rapid, iterative development and portability across any infrastructure. Container platforms—like Kubernetes—have risen out of the need to manage ever more distributed and dynamic cloud native applications. Problem solved, right? Wrong.
As we’ve said before, with containers the artifacts change, but the challenge stays the same: How do you continuously assure performance, while maximizing efficiency and maintaining compliance with business and IT policies?
Kubernetes is a platform for building platforms. It’s a better place to start; not the endgame.
— Kelsey Hightower (@kelseyhightower) November 27, 2017
Kubernetes is just the start—albeit the start of great things. There’s a reason Kubernetes has an ecosystem surrounding it. There are a lot of people and organizations, including Turbonomic, working to make it even better. Today we’ll zoom in on a specific example: Kubernetes pod rescheduling.
A Pod’s Life…without Turbo
A Kubernetes pod is a group of containers that share the same characteristics (IP address, port space, etc.) and can communicate with each other without special configuration. The life of a pod consists of being created, assigned a unique ID (UID), and scheduled to a node until termination or deletion—generally executed through a controller or manually by a person. Life is simple, sweet, and the pod is none the wiser for how healthy it really is—but the end-user is…
Slow Deaths of a Pod
Application performance is a full-stack challenge and only as good as its weakest link. With that, resource contention can take many forms, affecting pods and nodes (and, of course, end-users). We’ll go into a deeper discussion in a separate post of why full-stack control—from applications through the infrastructure—is an absolute necessity. For this post, however, we’ll focus on the pod and node layers.
When it comes to applications, slow is the new down. Since we’re having a bit of fun with analogies today (like most days admittedly) let’s agree that slow, poorly performing pods might as well be dead. So, left to their own devices, pods can die a number of slow deaths (or never come to life).
- Death by “Noisy Neighbor”—Pods on the same node peak together and cause resource contention.
- Death by CPU Starvation—Nodes are unable to provide pods with the CPU they need to perform.
- Long Pending Pod—never mind death, think pre-life purgatory. This pod never even sees the light of day because it can’t be scheduled due to resource fragmentation.
If you think this is morbid, think about the customer that’s about to take their business elsewhere because your application isn’t cutting it.
Pod Reincarnation aka Rescheduling
On to happier subjects… Let’s talk about pod reincarnation with Turbonomic. How do you avoid performance degradation in these scenarios? Reschedule the pod.Without Turbonomic poorly performing pods affect end-users because once a pod is scheduled, it can’t be “moved” to another node. Performance degradation occurs, that pod “dies,” and then a new pod is spun up to service that demand on whatever node is available to it. It’s reactive and impacts the end-user experience. But, what if pods weren’t bound to nodes for life? What if a pod could start a second life on a node that is better for meeting service levels, before performance degrades?
That’s exactly what Turbonomic does. The platform continuously analyzes the changing resource demands of all the pods, available capacity, and constraints of the nodes. It determines which nodes a pod should reside on, ensuring they always get exactly the amount of resources they need—no more, no less—while maintaining compliance with policies (label & selector, affinity/anti-affinity, taint & toleration, etc.). When Turbonomic reschedules a pod, it does so to prevent performance degradation, spinning up the new pod before terminating the old pod, so service is never disrupted.
Note: Turbonomic delivers full-stack automation, which includes node placement, sizing and provisioning, but we’ll save that discussion for another post.
Check out the Demo
So, to recap, Turbonomic reschedules pods to avoid the following performance risks:
- Noisy Neighbor—if pods start to peak together at the risk of resource contention, Turbonomic will determine which pods should be copied to a node without resource contention and terminate the old pod.
- CPU Starvation—if a node is at risk of not getting enough CPU to serve the needs of pods, Turbonomic will reschedule pods to a node with enough resources. Alternatively, it can resize the node—but that’s a subject for—you guessed it—another post: Intelligent Cluster Scaling with Turbonomic.
- Long Pending Pods—Turbonomic puts an end to this pod purgatory because it will continuously reschedule existing pods to avoid resource fragmentation and ensure that new pods can always be scheduled.
As more organizations re-architect their monolithic applications, containers no longer just run short-lived tasks. With that, resource consumption of these container workloads fluctuates, and adjustments must be made in real time to be performant and truly elastic.
Production: Where It Gets Real
Cloud native is that recent college grad with a lot of potential and they know it. Yet they still have so much to learn when they graduate and go off to…Production. For most organizations, containers aren’t being run in production at scale so a little over-provisioning at the pod, node, or cluster level doesn’t hurt…and performance degradation certainly hurts more than that. But, Production is where it gets real—where IT must constantly navigate the tradeoffs between performance, compliance, and efficiency. Having software (not people) continuously answer (and execute) what to do and when to do it is imperative to achieving the full potential of container platforms.