How Chronosphere engineering reduced compute infra costs with automated K8s rightsizing

Engineering preview card with image (3)
ACF Image Blog

Rightsizing can be a time-consuming and tricky process. Read how the Cloud Infrastructure team saved cost and time through automation.

Mila Udot Member of Technical Staff, Cloud Infrastructure | Chronosphere

Mila Udot is a Member of Technical Staff on the Cloud Infrastructure team at Chronosphere. She has over a decade of experience developing and optimizing scalable and distributed systems for enterprise-level software applications. Prior to Chronosphere, she worked at AWS. She currently resides in Seattle and enjoys hiking, yoga, and exploring the beautiful outdoors.

5 MINS READ

In Chronosphere’s earlier days, the Cloud Infrastructure team was tasked with simply spinning up infrastructure to run our customer’s workloads. Over time, we realized that without processes or financial framework in place, provisioning infrastructure became much more costly and we were paying for CPU and memory resources and that ended up being unused. 

This started a rightsizing project, which included running analysis tools to determine appropriately sized resource requests for services with low utilization. Our environment consists of Google Cloud, with Kubernetes workloads in Google Kubernetes Engine. In terms of scale, our two production clusters run dozens of tenants with tens of thousands of pods (as of December 2023). 

We wanted to add a mechanism to control what we request so we don’t use more resources than necessary. Here’s a look at the overall project, the processes involved, its impact for our team, and future use cases.

Project goals

Previously, we estimated the resource requirements for containers with historic data or initially-set values. We decided to review the resource requests and adjust them using actual usage data.

Our overall project goal was to reduce overall spending by 3% on compute resources, and to reduce time spent manually matching the amount of CPU and memory we needed to effectively run customer deployments; this process often took a few days to set parameters, apply, and deploy to all clusters/tenants.   

Before we started, we noticed that for some of our workloads max memory utilization during the previous month was less than 30% across all clusters and tenants.

With this information in mind, we decided on two main goals for this project:

-Update underutilized workload requests with new right-sized values.

-Automate the rightsizing process.

Getting set for success

Then, we realized we needed to select an autoscaler to help manage memory and networking resources as they fluctuated; the options included a vertical pod autoscaler (VPA) or an internal solution of our own. We chose our own internal solution because:

-VPA doesn’t work when we use a horizontal pod autoscaler (HPA).

-We could use cAdvisor metrics stored on our side.

-We could use our own formulas to provide recommended values instead of relying on the VPA-owned formula.

Next, it was time to create a formula to help us calculate how much infrastructure to provision for our tenants. It combined cost, time, and utilization. We broke down each component as follows: 

Cost

We wanted to stay on the safe side and not affect service performance by limiting resources too much. We decided to request from Google Cloud at least max usage of CPU/memory over some past period of time.

Time period

To figure out the formula first we were thinking about what time of previous usage gives us enough confidence to base our assumptions on. We decided that one month was long enough to capture recent changes in load.

Usage

UtilizationFlagValue depends on the workload and resource type.Standard for CPU = 1. CPU is considered a “compressible” resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container. But the app stays alive. The standard for memory = 0.8. Memory is an incompressible resource, meaning if you are out of memory and want to allocate memory for a new or existing process you have to either kill a process that is taking up memory space or the process will crash.

This helped us create our final equation to figure out how to effectively request infrastructure without overprovisioning: 

recommendedRequest = containerMax30Days/UtilizationFlagValue

Here, the UtilizationFlagValue < 1 and doesn’t reach the max of previous usage. 

Implementation

With our goals and formulas in place, it was time to set things in motion. To gather the basic information – and set a baseline – we used the following PromQL queries to get peak usages of memory and CPU over the past 30 days:

max_over_time(max(container_memory_usage_bytes{cluster="${cluster}", name!="", namespace="${tenant}", pod=~"${pod_prefix}.*", container="${container}"}/1024/1024/1024)[30d:])

max_over_time(max(rate(container_cpu_usage_seconds_total{cluster="${cluster}",  name!="", namespace="${tenant}", pod=~"${pod_prefix}.*", container="${container}"}[5m]))[30d:])

As part of the project, the cloud infrastructure team had created a command-line interface (CLI) reporting tool which runs queries above against all the cluster’s tenants and generates suggestions.

The tool provides CPU recommendations equal to the max CPU over the past 30 days and uses MemoryUtilization flag (default = 0.8) for memory recommendations. Its formula for recommendations is: 

recommendedMemoryRequest = containerMaxAvg30Days/memoryUtilizationFlagValue

After we applied values generated by the tool to our most underutilized workloads, we saved 5% of our compute costs. We manually applied the new suggested values to start saving money as soon as we could.

Then, we started to work on process automation. We had already automated our service releases, and then decided to make rightsizing a part of the service release process. It enabled periodical adjustments to resource requirements based on changes in the application or workload.

Project outcomes and future plans

After we automated rightsizing, we got two main outcomes: Cost and time savings. 

After we manually right-sized our most underutilized workloads, we were able to initially save 5% of our compute costs. 

In addition to our initial cost savings, rightsizing is a part of our standard release process. We can opt-in any workload to be right-sized with custom utilization flags. Automation eliminates the need for manual intervention, which reduces the time and effort required to perform rightsizing activities. 

Having automated rightsizing adapts dynamically to changing workload demands, resource utilization patterns, and infrastructure conditions. This enables us to maximize the value of our Kubernetes infrastructure, improve resource utilization, and achieve overall better cost-efficiency.

Suggested Reading

How Chronosphere built a deployment system with Temporal
7 min read.
Using Chronosphere to monitor your GKE Autopilot deployment
4 min read.
Share This: