Service level objectives (SLOs) are a foundational element of DevOps practice and of a Site Reliability Engineer’s (SREs) responsibilities. They represent internal goals around the essential metrics of a service (e.g. latency, availability, and correctness). Along with service level agreements (SLAs) and service level indicators (SLIs), SLOs help to define and monitor the level of service and reliability of a software solution to end users or customers.
As stated in the Google SRE handbook, “These measurements describe basic properties of metrics that matter, what values we want those metrics to have, and how we’ll react if we can’t provide the expected service. Ultimately, choosing appropriate metrics helps to drive the right action if something goes wrong, and also gives an SRE team confidence that a service is healthy.”
Before diving into how Chronoshere generates and monitors SLOs, let’s define some key terms around these levels of service:
Additional information can be found in a Google SRE handbook.
At Chronosphere, we have objectives around maintaining a highly reliable service for customers. And in order to help us track and meet these objectives, we’ve published internal SLOs around three categories – availability, performance, and correctness – across our core product functionality for all data types (e.g. metrics and traces). This ensures we’re answering important questions for our customers like:
We are currently focused on defining SLOs around the core functions (e.g. metric ingestion, aggregation, querying) as they directly impact our customers. And by focusing on the highest priority use cases, we are building the foundations for a culture of SLOs at Chronosphere.
In order to measure the success of our SLOs, we have several SLI metrics to determine the guardrails of each objective (i.e. what’s in and out of scope). Defining corresponding SLIs for SLOs enables our engineering team to more quickly quantify levels of risk and/or to assess the urgency of an outage.
Our SLIs help engineers answer the “where”, “what”, and “how” for our SLOs. For example:
SLOs measure some aggregate view of the SLI metrics, which means it’s important to select an aggregation level that does not accidentally mask problems for a subset of the traffic. For example, think about monitoring the error rate across all API endpoints for a server. Infrequently-used endpoints may fail 100% of the time, but they could still get lost in the aggregate error rate since the 99.9% objective is still met.
All Chronosphere SLOs have dimensions for cluster and customer, and additional dimensions are added as needed. For example, if a branch of the code significantly changes how the traffic is handled, it needs a dimension.
As described in the Google SRE handbook, we monitor our SLOs by the error budget burn rate versus by an individual SLI (i.e. recent error rates). This helps ensure that our alerts are actionable while also reducing the noise of events that are not significantly impacting the error budget.
Error budgets are the tool to balance service reliability with the pace of innovation. Changes are a major source of instability, and development work for features competes with development work for stability. The error budget forms a control mechanism for diverting attention to stability as needed. We calculate error budget by 1 minus the SLO of the service. For example, a 99.9% SLO service has a 0.1% error budget.
When monitoring SLOs, we look at the burn rate of an error budget. The burn rate is essentially a product of the SLO period by time window and the error budget consumed (see below).
If a budget is ever exhausted, there are policies in place to halt development, post-mortem, and discuss potential adjustments to an SLO.
To ensure correctness for overall metric ingestion, we have probers in each customer environment to continuously run “tests” and verify the correct output. For example, if a service successfully returns responses but the responses are incorrect, it means the service is unstable. So in order to better ensure correctness, we use black box probing to control the inputs and verify the outputs. This process is similar to unit testing in production.
While probes help cover any gaps in the monitoring of live user traffic, they are not a replacement for live traffic SLIs which measure at least some portion of an actual user experience. At Chronosphere, we operate under the recommendation that an SLO uses as many live traffic SLIs as possible and probes SLIs to cover any holes in the monitoring process.
If you’re using a Prometheus-compatible solution, and want to manage SLOs in a consistent manner, then Sloth – an open source tool used to generate SLOs and SLIs based on Prometheus metrics – is a good solution. The tool works by generating a set of recording rules, Grafana dashboards, and alerts based on a set of time series that a user inputs.
As we continue to build an engineering culture around SLOs at Chronosphere, the engineering and product teams will review SLOs and SLIs quarterly and make adjustments as needed. There are also plans to extend the prober to add additional checks across all core product functionalities.
Chronosphere is the only observability platform that puts you back in control by taming rampant data growth and cloud-native complexity, delivering increased business confidence. Built from the ground up for cloud-native scale and complexity, we provide our customers with industry-leading reliability and SLAs.
If you are interested in learning more about service level objectives or service level indicators, request a product demo today.
Request a demo for an in depth walk through of the platform!