What is site reliability engineering (SRE) and what do the acronyms that come along with it – SLI, SLO, and SLA – stand for?
On: Dec 1, 2022
When you move business-critical applications or infrastructure to the cloud, site reliability engineering (SRE) emerges as an extremely important enterprise function. But what is SRE, and what do the acronyms that come along with it – SLI, SLO, and SLA – stand for?
SRE is what you get when you apply software engineering skills and practices to operating, scaling, and maintaining the system from the edge to the infrastructure. The goal: reliability, or keeping systems running and available for your users. The other acronyms are all ways to quantify your commitments to system uptime and measure how successful your SRE team is at meeting them.
Here’s an example. Your internal goal for system availability is ambitious: four nines, or 99.99%. That’s your SLO. However, you want to give yourself a little wiggle room with users – all the people who depend on your systems such as employees, customers or even partners – so you promise to deliver only 9.99% availability. That means you are committed to keeping systems up 99.9% of the time. That’s your SLA. Finally, when you track the actual uptime and response rates, you find you are achieving 99.96 availability. That’s your SLI, and that means you meet your SLA but not your SLO. The result: Users are happy, but there’s room for your SRE team to improve.
Now we’ll take a closer look at SLIs, SLOs, and SLAs, and recommend some best practices for each.
SLIs are the quantifiable measures of service reliability such as throughput, latency and correctness that are directly measurable and observable and used to determine if SLOs and SLAs are met. In short, SLIs are how you measure the various items you’ve established as important in your SLOs and SLAs to determine whether you’ve met them.
SLI best practices include:
SLAs are the promises you make to users of a system that guarantee a specified and measurable level of availability and performance to them. Typically, penalties are triggered if you don’t meet SLAs. These agreements can be legally binding, and they can be with internal users — such as business departments or employees — or with external parties like customers or partners.
In short, you can think of SLAs as SLOs, but with consequences. For example, you might offer customers an SLA of four nines (99.99%) for a system, which allows up to 52 minutes 32 seconds of downtime per year, and reduce their payments by an agreed-upon proportion if you fail to meet that uptime.
SLAs are important because they clearly set expectations between you and your users — expectations that are quantified, measured, and enforced with consequences if missed.
Here are some best practices for establishing SLAs.
SLOs are internal targets for keeping services available and performing as needed by users. The purpose of an SLO is to measure the customer experience, (if applicable) protect the company from SLA violations and create a shared understanding of reliability across product, engineering and business leadership.
SLOs are complemented by error budgets or the allowed failure rate by a system. For example, if your SLO is 99%, you have a 1% error budget.
SLOs and error budgets are important because they give engineering teams permission to innovate and take risks without affecting operations. Good SLOs offer developers sufficient space to try new things or improve existing systems, which can cause downtime, without making users unhappy.
In other words, SLOs offer a way to surface the risks to the user experience and reliability of a product or service. While setting SLOs will not magically make your system more reliable, they will help your operators, developers and product managers gauge when and where to invest their efforts. Reliability is often measured in percentages using a stated number of nines. For example, SLOs could promise:
Each “nine” requires more money and effort from your software engineering team. So as much as you’d like to offer 99.99% (or more!) for reliability, you need to be realistic. But SLOs can be expressed in other ways too: the time it takes for your SRE team to respond to an issue, for example, or the application performance index (Apdex), which is a standard for assessing customer satisfaction based on a number of system performance metrics.
Here are some best practices to help you achieve the goals established by your SLOs:
SLIs, SLOs, and SLAs are key to measuring the customer experience of software-based businesses. They represent internal goals around the essential metrics of a service.These metrics help to define and monitor the level of service and reliability of a system to users – internal and/or external.
At Chronosphere, we have objectives for maintaining a highly reliable service for our customers. To help us track and meet these objectives, we’ve published internal SLOs around three categories — availability, performance, and correctness — across our core product functionality for all data types (metrics and traces.) This ensures we’re answering important questions for our users including:
We are currently focused on defining SLOs around the core functions (for example, metric ingestion, aggregation and querying) as they directly impact our customers. And by focusing on highest-priority use cases, we are building the foundations for a culture of SLOs, SLAs, and SLIs.
As organizations move to the cloud and adopt microservices-based architectures, SLOs provide a way for SRE teams to set specific, measurable availability goals and track them (SLIs) to make sure users are receiving agreed-upon service levels (SLAs) within today’s highly complex cloud native environments.
Request a demo for an in depth walk through of the platform!