Service Level Key Terms
- Service Level Objective: An acceptable target level of service for a particular capability over a period of time that reflects the user’s experience.
- Error Budget: Complement to an SLO, reflecting the amount of unreliability users will tolerate
- Service Level Indicator: The numerical measurement/metric used to calculate/reflect the quality of user experience.
- Service Level Agreement: The legal contractual agreement between a company and customer that outlines what service is offered, committing to level of service, sometimes consequences for failing to meet the committed service levels.
Reliability Is The Most Important Feature
Online services and apps play an integral role in our everyday lives, as our reliance on technology increases so do our expectations for digital experiences. If we go to use a service and it isn’t available, fast enough or error-free we become confused, angry, frustrated and lose a bit of trust in that service. Occasional service disruptions will be forgiven, but only up to a point. Think of user trust as a form of currency where good experiences earn more trust and bad experiences deplete that trust.
For example I have had my Gmail account for over a decade and have only one hazy memory of an outage that affected me many years back. That’s pretty impressive! Google gave a lot of thought to how to balance the need to innovate and deliver new functionality with delivering highly availability and reliability at scale. A new role was created, Site Reliability Engineer (SRE), which took a different approach to operations by blending software engineering skills and practices to maintaining, scaling, and monitoring systems.
From SRE came the practice of setting service level objectives and error budgeting based on the notion that a prerequisite to success is availability. Service level objectives set specific reliability targets allowing an organization to measure how well they are maintaining user trust, conventionally recorded as some percentage under 100%. An error budget is what is leftover, calculated by 1 – SLO, and represents the amount of service degradation or disruption that a user will tolerate. When the error budget is fully spent, that is an urgent signal to direct efforts towards stabilizing the service and increasing reliability. On the other hand when there is room in the error budget, that is a sign for teams to invest in innovation, experimentation, and delivering new features without jeopardizing user trust.
If you’ve experienced conference-driven development or been told “we already promised the market this feature would be out in April” you understand the pressure engineers are under. Organizations are reluctant to shift engineering priorities away from innovation towards reliability unless there’s strong evidence to do so. SLOs offer a way to agree and discuss upfront what aspects of the service are most critical to users and what is the minimum reliability that is acceptable. Meanwhile, error budgets make it clear what risks an organization is accepting if they do not prioritize stability work.
Now we’ll take a closer look at SLOs, error budgets, SLIs, and SLAs.
Wait, What’s a Service?
If you are, or have been, a developer, you may hear “service” and instantly think of an individual microservice or monolith that you write code for, build, deploy, and operate. But when it comes to service level agreements, objectives, and indicators, “service” is often used in the economic sense as in “goods and services”.
Let’s think about the service that a company like Netflix provides: the ability to stream content on demand.
From the user’s point of view, if there is an issue streaming, they only care about how the issue affects them. They do not care if 99.9% of other users were able to successfully stream or that the issue was a brief blip that only lasted one minute. They care about their specific and individual experience of the service and that they couldn’t stream the latest season of Selling Sunset. From a technical point of view, we know that streaming as a service requires the successful coordination and interaction of dozens or hundreds of components up and down the stack.
The better you are able to define, understand, and explain your system in terms of the user experience, the better you will be able to effectively communicate with your colleagues in PM, Sales, Marketing, etc. and work together to set realistic and meaningful reliability targets.
Service Level Agreements: Satisfaction Guaranteed or Your Money Back!
SLAs aren’t pinky promises, they are legally binding agreements between your company and its customers or users that outlines what level of service the company has committed to providing and what remedies exist if that level of service falls short. Remedies include providing service credits, partial or full refunds, or allowing customers to terminate their contract early among others. Breaching the terms of a service level agreement can have a far reaching impact beyond the relationship with any single customer, severe or repeated breaches can tarnish the company’s credibility, lead to losing existing customer base and affect ability to attract new prospective customers.
The level of service promised in an SLA is what your company is confident they can commit to delivering, often expressed as availability, or whether a system can fulfill its intended function at a specific point in time and covers only the essential workflows and use cases. Other SLA metrics include uptime, latency, throughput, or customer support response time. If you aren’t familiar with what your company provides as an SLA, ask someone for a sample contract or review publicly available terms like Chronosphere’s Availability Commitment.
Service Level Indicators: Show Me The Numbers!
An SLI is a quantitative measure of service performance that is used to determine if SLOs and SLAs are met. These indicators should be measured directly or as close as possible to the user’s perspective; in most cases, the further or more indirect the measurement is from the user experience, the less useful it is as an indicator. A meaningful SLI has a predictable relationship with customer happiness, answering the question “Is the service working as the user expects?” and aggregated over a reasonably long period of time to smooth out noise.
Note: this is when the term “service” gets overloaded and can cause confusion. SLI measurements can be gathered directly from an individual application or infrastructure component — a service in the technical sense — but need to indicate the quality of service in the economic sense.
Google SRE books advocate for measuring SLIs as a uniform percentage taking the ratio of good events divided by valid events multiplied by 100 or another option is tracking SLIs with percentiles which is common for tracking latency. How you define what makes an event good or valid depends on the type of technical service being provided and what metrics you can gather.
Here is a table with common SLIs for user facing services like an ecommerce storefront:
Question | Indicator | Example SLI |
---|---|---|
Could we respond to the request? | Availability | Login page response code != 500 |
How long did it take to respond to the request? | Latency | p95 login page latency < 200ms |
How many requests could be handled? | Throughput | Total requests to login page processed per minute |
In addition to availability, latency, and throughput, there are a number of other indicators that are useful for data pipeline or storage services. This includes durability, freshness, correctness, quality, and coverage.
What’s the difference between an SLI vs SLO?
When it comes to latency, an SLI answers “How fast was the service for this request?” and the SLO answers “Was the service fast enough for this request?”
What is an SLO?
Service Level Objectives are a specific, measurable target for the performance or reliability of a service, often used to ensure it meets defined user expectations.
Reliability, much like security, is not the sole purview of any individual or team. It is a shared responsibility that requires collaboration and coordination from everyone involved in delivering the service. It’s sort of like in elementary school when the teacher promised there would be a pizza party if every student was on their best behavior for the day. It only takes a single classmate acting up to ruin it for everyone.
In a technical sense, it does matter which “classmate” or component is acting up because that determines what team or engineers will be responsible for investigating and mitigating. Today’s complex systems have many parts to keep track of from CDNs, mobile, web, and desktop front ends to backend services like API gateways, load balancers, and datastores all the way down to underlying infrastructure services like networking and container orchestration/Kubernetes. In a business sense, however, it only matters that the overall user experience was negatively affected — this can surface as lower conversion rates, negative brand perception in the short term, and higher customer churn as well as lost revenue and opportunities in the longer term.
Here’s an example: Your company promises 99.9% availability on a monthly basis to customers in the SLA. Internally, your target is more ambitious with SLO targeting four nines or 99.99% availability for the same time period. If the reporting metrics measure actual availability at 99.96% that means the error budget ran out and the SLO was violated, but you are still meeting your SLA. This signals to the organization that users will not tolerate additional service disruptions for the time being. Without investing in stability work and continuing as-is, there is a risk of breaching the SLA.
What’s the difference between SLO vs SLA?
An SLA is an agreement between you and your customers with consequences that can be enforced. An SLO is an internal goal set by the company. While error budgets offer an accountability mechanism to keep within an SLO, they are not legally binding. Targets in SLOs are set higher and stricter than what is externally promised in an SLA. This means when the error budget runs low, the organization has wiggle room to adjust priorities and focus on reliability efforts all while still meeting their SLA.
What’s an Error Budget?
There is no blank check when it comes to user trust and an error budget tracks how much service disruptions users will tolerate. When monitoring SLOs, it is the error budget burn rate used to set up paging alerts, not an individual SLI. This helps ensure alerts are actionable and reduce the noise of events that do not significantly affect the user experience.
To calculate an error budget take 1 minus the SLO of a service – a 99.9% SLO has a 0.1% error budget. The burn rate is a product of the SLO period by time window and factors in the amount of error budget consumed.
Real World SLA vs SLO Explained
Let’s take Google Cloud as a real world example of SLAs and SLOs in action. Google Cloud offers a fully managed Kubernetes experience — GKE (Google Kubernetes Engine ) autopilot — for companies interested in the benefits of cloud native container orchestration without the burden of operational administration. When you lease a GKE cluster, what exactly is the service being provided and what level of service can you expect? What happens if service is not provided as agreed? All of these questions are answered in the Service Level Agreement for GKE which is an excellent example of an easy to understand agreement.
The Service
As a customer you might think about Autopilot GKE as a singular product. However, in the SLA, Google breaks it down into two separate services — one for the control plane and one for the pods:
- Autopilot Cluster (control plane) – the Kubernetes API provided by the customer’s cluster as long as the minor version of GKE deployed in the cluster is currently offered in Stable/Regular channels.
- Autopilot Pods in Multiple Zones – compute capacity provided by GKE Autopilot to schedule user pods, when pods are scheduled across 2+ zones in the same region.
To simplify our example, let’s focus only on one service, “Autopilot Cluster (control plane)”, or the brains of Kubernetes.
The SLO – Monthly Uptime Percentage
What level of service can you expect? Google is committed to deliver 99.95% Monthly Uptime Percentage, which is defined as “total number of minutes in a month, minus the number of Downtime minutes suffered from all Downtime Periods in a month, divided by the total number of minutes in a month.”
Cautionary note: Just because you cannot connect to your cluster’s Kubernetes API, doesn’t automatically mean that counts as Downtime! According to the SLA, Downtime is specifically defined as the “loss of connectivity or Kubernetes API access to all applicable running clusters with the inability to launch replacement clusters in any Zone” and Downtime Period is five or more consecutive minutes of Downtime.
The clause “…with the inability to launch replacement clusters in any Zone” is interesting because while you may think of the “Autopilot Cluster (control plane)” service in terms of a specific cluster like “gke-production-us”, that is not what you actually agreed to! To Google, the service is the ability to connect to the Kubernetes API for Autopilot Clusters across their infrastructure Zones. So if you happen to lose connectivity with cluster “gke-production-us” that on its own does not qualify as Downtime; it only counts if you lose connectivity and cannot launch and connect to a new Autopilot Cluster successfully for at least 5 consecutive minutes.
Google also excludes Scheduled Downtime, loss of connectivity, or other issues related to the underlying Compute Engine instances, which are covered by other SLAs or the failure of Kubernetes nodes or the pods running on those nodes from being counted as Downtime.
The Remedy
What happens if Google does not meet the SLO of 99.95% Monthly Uptime Percentage for a given month? You as a customer are entitled to receive Financial Credits, which are proportional to the amount of downtime. The bigger the SLO miss, the bigger the credit.
Reading the SLA carefully pays off – being entitled to credits doesn’t mean you’ll automatically get them in event of an SLO miss. The agreement states it is your responsibility to both track connectivity status and notify Google of the SLO miss and that you must provide evidence from server logs or monitoring reports within 30 days.
An Example
For the month of September with 30 days there are 43,200 minutes. To meet the 99.95% Monthly Uptime Percentage SLO for Autopilot Cluster (control plane), there can only be .05% Downtime which equates to just 21.6 minutes. In the event of an SLO miss, depending on how big the miss is, here’s the credits
Autopilot Cluster (control plane) & Monthly Uptime Percentage | Amount of Downtime in September | Percentage of Monthly Bill Credited to Future Monthly Bill |
---|---|---|
99.0% < 99.95% | 21.6 – 432 minutes | 10% |
95.0% < 99.0% | 432 – 2160 minutes | 25% |
< 95.0% | 2160+ minutes | 50% |
You can see that Google has taken the time to think through and separate GKE Autopilot clusters into multiple services and provide detailed definitions for what qualifies as downtime.
Please note: This walkthrough was just an example and does not constitute legal advice, please refer to the actual SLA your company signed for precise details. By being crystal clear about what is and isn’t covered by a service and what level of service Google can commit to providing and the course of action when service level falls below the expected target, you as a customer can have peace of mind knowing what uptime you can expect to depend on and that there is clear recourse in case of issues.
Wrap Up
The goals of monitoring have always been to know whether or not a system was healthy and functional enough — which is more challenging to determine when managing a complex distributed system of many dynamic and interdependent entities.
Somewhere on the spectrum between 0% — where a service never works — and the impossible state of 100% — where a service always works forever — there is a goldilocks zone of “just enough” reliability. SLOs and error budgeting offer organizations an early warning when reliability is trending in the wrong direction. While setting SLOs will not magically make your system more reliable, they will help your operators, developers, and product managers gauge when and where to invest their efforts. There is going to be a learning curve as you and your colleagues adopt these practices. As you progress on your journey, keep these wise words from Alex Hidalgo in mind:
“I’d tell people that SLOs aren’t a ‘thing you do’. It’s a different approach to thinking about the reliability of your services and it’s a new dataset you can use to have conversations and drive decisions. Don’t worry if you have trouble finding the appropriate targets at first and don’t worry if you can’t immediately use error budgets either. That’s not the point. The point is to stop relying on the telemetry you currently have and develop more meaningful numbers in terms of user experience.” – interview on SLOs.
To dive deeper into the world of monitoring SLOs and error budgets we recommend checking out Alex’s book Implementing Service Level Objectives, the SRE Workbook from Google and the 2023 SLOConf playlist The Business of SLOs.
When it comes to observability, platforms that struggle with reliability, and go down when you need them, provide zero value. That’s why Chronosphere offers a three nines (99.9%) uptime SLA and strives to overachieve. Last year, we were proud to deliver greater than four nines (99.99%) uptime to all of our customers every month. That’s less than 1 hour of downtime for the entire year! Read more about how our system was designed with reliability in mind and our rigorous approach to defining uptime in our SLA in Availability Is Our Highest Priority.