How Chronosphere generates and monitors SLOs 

on April 28th 2022

Service level objectives (SLOs) are a foundational element of DevOps practice and of a Site Reliability Engineer’s (SREs) responsibilities. They represent internal goals around the essential metrics of a service (e.g. latency, availability, and correctness). Along with service level agreements (SLAs) and indicators (SLIs), SLOs help to define and monitor the level of service and reliability of a software solution to end users or customers.  

As stated in the Google SRE handbook, “These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.” 

Terminology

Before diving into how Chronoshere generates and monitors SLOs, let’s define some key terms around these levels of service: 

  • Service – A service that is provided to an end user or customer. This is not a service in a microservice environment. A service typically requires many microservices to complete a service request.
  • Service Level Indicator (SLI) – An indicator or metric for the level of service you are providing. For example, gRPC error ratio is one indicator for service availability.
  • Service Level Objective (SLO) – The target value for a service level that is measured by an SLI over a period of time. An example is 99.95% availability for metric ingestion over the last 30 days. 
  • Service Level Agreement (SLA) – A contractual agreement with customers. If broken, customers receive a partial refund of their payments. Typically an external SLA is a looser form of an internal SLO (99.9% vs 99.95%).

More thorough definitions can be found in a Google SRE handbook.

SLOs at Chronosphere 

At Chronosphere, we have objectives around maintaining a highly reliable service for customers. And in order to help us track and meet these objectives, we’ve published internal SLOs around three categories – availability, performance, and correctness – across our core product functionality for all data types (e.g. metrics and traces). This ensures we’re answering important questions for our customers like:

  • “Can the service be used?” for availability;  
  • “Is the service fast enough?” for performance, and;  
  • “Does the service return correct results?” for correctness    

We are currently focused on defining SLOs around the core functions (e.g. metric ingestion, aggregation, querying) as they directly impact our customers. And by focusing on the highest priority use cases, we are building the foundations for a culture of SLOs at Chronosphere.

SLIs to inform SLOs

In order to measure the success of our SLOs, we have several SLI metrics to determine the guardrails of each objective (i.e. what’s in and out of scope). Defining corresponding SLIs for SLOs enables our engineering team to more quickly quantify levels of risk and/or to assess the urgency of an outage. 

Our SLIs help engineers answer the “where”, “what”, and “how” for our SLOs. For example:  

  • Where along the service do you measure?
  • What is a good indicator for the service level? What dimensions should you measure?
  • How do you measure success?

SLO Dimensions 

SLOs measure some aggregate view of the SLI metrics, which means it’s important to select an aggregation level that does not accidentally mask problems for a subset of the traffic. For example, think about monitoring the error rate across all API endpoints for a server. Infrequently-used endpoints may fail 100% of the time, but they could still get lost in the aggregate error rate since the 99.9% objective is still met.

All Chronosphere SLOs have dimensions for cluster and customer, and additional dimensions are added as needed. For example, if a branch of the code significantly changes how the traffic is handled, it needs a dimension.

Monitoring SLOs

As described in the Google SRE handbook, we monitor our SLOs by the error budget burn rate versus by an individual SLI (i.e. recent error rates). This helps ensure that our alerts are actionable while also reducing the noise of events that are not significantly impacting the error budget. 

What do we mean by an error budget burn rate? 

Error budgets are the tool to balance service reliability with the pace of innovation. Changes are a major source of instability, and development work for features competes with development work for stability. The error budget forms a control mechanism for diverting attention to stability as needed. We calculate error budget by 1 minus the SLO of the service. For example, a 99.9% SLO service has a 0.1% error budget.

When monitoring SLOs, we look at the burn rate of an error budget. The burn rate is essentially a product of the SLO period by time window and the error budget consumed (see below).  

If a budget is ever exhausted, there are policies in place to halt development, post-mortem, and discuss potential adjustments to an SLO. 

Measuring correctness of SLOs

To ensure correctness for overall metric ingestion, we have probers in each customer environment to continuously run “tests” and verify the correct output. For example, if a service successfully returns responses but the responses are incorrect, it means the service is unstable. So in order to better ensure correctness, we use black box probing to control the inputs and verify the outputs. This process is similar to unit testing in production.  

While probes help cover any gaps in the monitoring of live user traffic, they are not a replacement for live traffic SLIs which measure at least some portion of an actual user experience. At Chronosphere, we operate under the recommendation that an SLO uses as many live traffic SLIs as possible and probes SLIs to cover any holes in the monitoring process.

Generating SLOs with Sloth 

If you’re using a Prometheus-compatible solution, and want to manage SLOs in a consistent manner, then Sloth – an open source tool used to generate SLOs and SLIs based on Prometheus metrics – is a good solution. The tool works by generating a set of recording rules, Grafana dashboards, and alerts based on a set of time series that a user inputs. Our customer success team has defined some best practices around Sloth for our customers, and if interested in learning more, feel free to reach out to contact@chronosphere.io

What’s next for SLOs at Chronosphere? 

As we continue to build an engineering culture around SLOs at Chronosphere, the engineering and product teams will review SLOs and SLIs quarterly and make adjustments as needed. There are also plans to extend the prober to add additional checks across all core product functionalities. 

Chronosphere is the only observability platform that puts you back in control by taming rampant data growth and cloud-native complexity, delivering increased business confidence. Built from the ground-up for cloud-native scale and complexity, we provide our customers with industry leading reliability and SLAs. If interested in learning more, visit chronosphere.io or request a product demo

Interested in what we are building?