When you move business-critical applications or infrastructure to the cloud, site reliability engineering (SRE) emerges as an extremely important enterprise function. But what is SRE, and what do the acronyms that come along with it – SLI, SLO, and SLA – stand for?
SRE is what you get when you apply software engineering skills and practices to operating, scaling, and maintaining the system from the edge to the infrastructure. The goal: reliability, or keeping systems running and available for your users. The other acronyms are all ways to quantify your commitments to system uptime and measure how successful your SRE team is at meeting them.
- SLI (service-level indicators) – The actual numbers measuring the health of a system.
- SLO (service-level objective) – Your organization’s internal goals for keeping systems available and performing up to standard.
- SLA (service-level agreement) – Your commitments (often legal) to your customers about system availability, response time in case of issues and the consequences if you don’t meet those commitments. (Your SLA will promise reliability that is at most equal to, but frequently less than, your internal SLO goal.)
Here’s an example. Your internal goal for system availability is ambitious: four nines, or 99.99%. That’s your SLO. However, you want to give yourself a little wiggle room with users – all the people who depend on your systems such as employees, customers or even partners – so you promise to deliver only 9.99% availability. That means you are committed to keeping systems up 99.9% of the time. That’s your SLA. Finally, when you track the actual uptime and response rates, you find you are achieving 99.96 availability. That’s your SLI, and that means you meet your SLA but not your SLO. The result: Users are happy, but there’s room for your SRE team to improve.
Now we’ll take a closer look at SLIs, SLOs, and SLAs, and recommend some best practices for each.
A deeper dive into SLIs
SLIs are the quantifiable measures of service reliability such as throughput, latency and correctness that are directly measurable and observable and used to determine if SLOs and SLAs are met. In short, SLIs are how you measure the various items you’ve established as important in your SLOs and SLAs to determine whether you’ve met them.
SLI best practices include:
- Agreeing on the processes and methodologies used to generate SLIs. This eliminates any possible misunderstandings about the numbers and how they were measured.
- Keep it simple. You have the option of monitoring numerous items as SLIs. Avoid over-measuring to keep costs, effort, and confusion to a minimum.
- Find out what your users expect from your service. Use it to determine which indicators to collect to deliver what they want.
A deeper dive into SLAs
SLAs are the promises you make to users of a system that guarantee a specified and measurable level of availability and performance to them. Typically, penalties are triggered if you don’t meet SLAs. These agreements can be legally binding, and they can be with internal users — such as business departments or employees — or with external parties like customers or partners.
In short, you can think of SLAs as SLOs, but with consequences. For example, you might offer customers an SLA of four nines (99.99%) for a system, which allows up to 52 minutes 32 seconds of downtime per year, and reduce their payments by an agreed-upon proportion if you fail to meet that uptime.
SLAs are important because they clearly set expectations between you and your users — expectations that are quantified, measured, and enforced with consequences if missed.
SLA best practices
Here are some best practices for establishing SLAs.
- Specify metrics that drive each party to do the right thing. Motivating both your team and your users to act appropriately is critical. Then everyone will do their part to ensure that the SLAs are met.
- Make sure you are measuring items within your control. Users may act in ways that make it impossible to meet your SLAs. Prevent this from happening by choosing measurements that truly reflect how you are managing the system.
- Choose metrics that are easy to get. In the perfect world, SLA metrics can be captured automatically with little effort or overhead.
- Less is more. Citing too many metrics as part of an SLA will force you to collect too much data and increase cost and effort.
- Be reasonable. SLAs must be reachable or they won’t be useful. Continually revisit and revise SLAs based on experience.
- Document everything. Thoroughly document everything agreed to between yourself and your users, including what SLIs are used and how frequently they will be checked.
A deeper dive into SLOs
SLOs are internal targets for keeping services available and performing as needed by users. The purpose of an SLO is to measure the customer experience, (if applicable) protect the company from SLA violations and create a shared understanding of reliability across product, engineering and business leadership.
SLOs are complemented by error budgets or the allowed failure rate by a system. For example, if your SLO is 99%, you have a 1% error budget.
SLOs and error budgets are important because they give engineering teams permission to innovate and take risks without affecting operations. Good SLOs offer developers sufficient space to try new things or improve existing systems, which can cause downtime, without making users unhappy.
In other words, SLOs offer a way to surface the risks to the user experience and reliability of a product or service. While setting SLOs will not magically make your system more reliable, they will help your operators, developers and product managers gauge when and where to invest their efforts. Reliability is often measured in percentages using a stated number of nines. For example, SLOs could promise:
- One nine – 90%
- Two nines – 99%
- Three nines – 99.9%
- Four nines – 99.99%
Each “nine” requires more money and effort from your software engineering team. So as much as you’d like to offer 99.99% (or more!) for reliability, you need to be realistic. But SLOs can be expressed in other ways too: the time it takes for your SRE team to respond to an issue, for example, or the application performance index (Apdex), which is a standard for assessing customer satisfaction based on a number of system performance metrics.
SLO best practices
Here are some best practices to help you achieve the goals established by your SLOs:
- Keep SLOs simple and realistic. Avoid the unachievable. Likewise don’t make them too easy. Establish SLOs that will genuinely make your customers happy.
- Start small. SLOs can seem like a daunting initiative to take on and can be difficult to staff. Try setting an SLO for one critical user journey in your system. Look back at historical data for the past month or quarter — would your service have met that SLO? Or, you can set up a synthetic monitor to regularly check the experience and see what you discover after a month or two of experimenting.
- Get SREs and the business users to collaborate. Make sure both your engineers who have to deliver on SLOs and the people who need the systems up to do their jobs agree on what those SLOs should be. Business users range from product managers, engineering managers, software engineers and SREs.
- Be flexible. SLOs aren’t written in stone, so embrace a practice of iterating. As your system architecture, product experience and other factors change, so should your SLOs.
How Chronosphere works with SLIs, SLOs, and SLAs
SLIs, SLOs, and SLAs are key to measuring the customer experience of software-based businesses. They represent internal goals around the essential metrics of a service.These metrics help to define and monitor the level of service and reliability of a system to users – internal and/or external.
At Chronosphere, we have objectives for maintaining a highly reliable service for our customers. To help us track and meet these objectives, we’ve published internal SLOs around three categories — availability, performance, and correctness — across our core product functionality for all data types (metrics and traces.) This ensures we’re answering important questions for our users including:
- “Can the service be used and trusted?” for availability
- “Is the service fast enough?” for performance
- “Does the service return accurate results?” for correctness
We are currently focused on defining SLOs around the core functions (for example, metric ingestion, aggregation and querying) as they directly impact our customers. And by focusing on highest-priority use cases, we are building the foundations for a culture of SLOs, SLAs, and SLIs.
As organizations move to the cloud and adopt microservices-based architectures, SLOs provide a way for SRE teams to set specific, measurable availability goals and track them (SLIs) to make sure users are receiving agreed-upon service levels (SLAs) within today’s highly complex cloud native environments.