Stop thinking in burn rates

Three people collaborate at a desk with laptops and papers, discussing startup finances, while a large green circle featuring a lightbulb and circuit icon overlaps the right side of the image.
ACF Image Blog

Learn why fixating on burn rate thresholds when setting up an SLO is unnecessary, and how a robust observability platform lets you focus instead on collecting the right metrics to measure your users’ experience.

 

A man wearing a straw hat, sunglasses, and a navy shirt smiles while standing on a boardwalk with a sandy beach in the background.
Marcus Hill | Member of Technical Staff

Marcus Hill, a member of technical staff at Chronosphere, has delivered features like Derived Telemetry, Lens and SLOs. His experience at IBM and Microsoft instilled in him an appreciation for large-scale operations in both business and technology. Marcus has also worked at smaller software shops, which broadened his perspective on how different organizations use technology to solve problems. Marcus lives in Philadelphia and enjoys exploring all things food and wine.

5 MINS READ

Integrating SLOs into workflows

Across the industry, there’s great diversity in the way people integrate service-level objectives (SLOs) into their workflows. Some companies use the same SLOs to alert engineers of system problems and report the health of the system to executives. Others tailor their SLOs for the different use cases.

We’ve even met some organizations that only use SLOs for their engineering organization and hand-roll higher-level reporting, while others use SLOs with no alerting enabled at all. In all cases, they’ve taken the simple concept of SLOs and used it to drive outsized value.

How Chronosphere uses SLOs

Our team at Chronosphere has been dedicated to SLOs for the past year. We’ve lived and breathed the practice, not just in building the feature but in working with our customers and prospects to understand the challenges they face implementing and depending on them. Through this process, we’ve gained insight into not just the theory of SLOs but how organizations operationalize them day to day.

As an individual contributor, there’s a crucial lesson that I’ve learned: Burn rates are a horrible thing to obsess over when configuring an SLO. The burn rate or error burn rate is the rate at which a service is consuming its error budget. This is useful for estimating how quickly a service will breach its SLO, and is essential for SLO alerting to work well.

However, as an end user, fixating on the burn rate threshold when setting up an SLO is unnecessary. A robust observability platform should enable you to focus on collecting the right metrics to measure your users’ experience without getting lost in minutiae.

Doing the math: Calculating burn rates

To understand why burn rates can be complex, let’s review the core components of an SLO. An example SLO would be that some service must meet a target of 99.9% of requests being successful over a 30-day period. In that SLO, the objective would be 99.9%, the service-level indicator (SLI) would be the percentage of requests that were successful and the time window would be 30 days.

Every SLO also has an error budget, which you can think of as the acceptable number of errors or downtime before a service fails to meet its objective. The error budget is calculated as 1 - objective ( 1-0.999 = 0.001, or 0.1%). Depending on how the SLO is defined, it may be easier to think of the error budget in terms of minutes. To calculate this, you can use the formula below:

error budget (minutes) = (1 - objective ) * minutes in time window

Of course, the error rate that an SLO sees can be calculated as `bad events / total events` or if measuring time units bad minutes / total minutes.

Burn rates are a unitless number that estimates how quickly the error budget would be consumed at the current error rate. It is defined as a simple ratio:

burn rate over some time period X = (error rate over X) / (1 - objective)

Or

burn rate over some time period X = (error rate over X) / (error budget)

A burn rate of 1 indicates that you will consume your entire budget at exactly the end of your SLO time window. A burn rate of greater than 1 indicates that, at the current error rate, you will have an SLO miss before the end of the SLO window. For example, a burn rate of 2 indicates that you will hit exactly 0 error budget remaining at exactly halfway through your time window.

Whitepaper: Balancing Cardinality and Cost

Learn how to control costs while managing high-cardinality data in your observability strategy

Burn rate considerations

One thing that people often find confusing about burn rates is that they do not consider how the service has historically done over the entire SLO time window, instead focusing on how the service is doing in the more recent past. This allows burn rate alerts to continue to behave the same regardless of any previous outage. Alerting on the remaining error budget in a time window approaching 0 can be done using separate monitoring.

All of this is critical for the underlying alerting (detailed in Chapter Five of the SRE [site reliability engineering] Workbook). However, a developer-focused observability platform should obscure these details and allow you to focus on more direct measurements like your availability or the error rate of your system over time.

Why burn rates are the wrong answer

SLOs are, at their core, a technical tool for a business problem. The questions you should be asking yourself when you’re creating a new SLO are centered on how your system serves your business.

These are things like:

  1. What is the correct thing to measure the impact on my users or upstream consumers?
  2. What do I need to change about my system to measure this?
  3. How much degradation is acceptable over the time window?
  4. How critical is this service? What amount of impact is worth paging someone for? Is the answer different during the day vs. the middle of the night?

All of that is hard enough that many of our competitors don’t center alerting in their SLO configuration, preferring to offer alerting as a secondary step. Creating and managing SLOs is hard enough for all of these nontechnical reasons. Adding in burn rates and doing math in a language like PromQL, with its own gotchas, just makes it harder. And no one needs that headache.

A more user-friendly approach focuses on asking practical questions:

  1. What percentage of the budget should trigger an alert?
  2. And over what time window?

This allows engineers to focus on their operational and business needs, not interpreting magic numbers. From there, we calculate the burn rates on your behalf in the exact way that you would. You can keep the SRE book defaults, or you can tweak them as necessary.

However, we recommend that you don’t fixate on the burn rate. It’s there; you can see it. But when monitoring your SLOs, it’s better to focus on error rates and counts. They’re more grokkable and your less SLO-familiar teammates will thank you!

Manning book: Platform Engineering on Kubernetes

Learn about open source solutions and the latest best practices from the Kubernetes community

Share This: