In today’s world, software and systems availability, performance and capacity are essential for delivering customer experience and generating revenue. Studies show that just one bad customer experience can result in churn. As a result, organizations are moving quickly to hire site reliability engineers (SREs) to support operations and boost SRE processes.
Anyone can read the canonical SRE book from Google to understand it in theory, but what does a high-performing SRE team look like in practice? There are really two aspects to effective site reliability engineering: a commitment to people and the practice, explained Matt Schallert, senior software engineer at Chronosphere and former SRE at Uber and Tumblr, during a recent Techstrong devops.com roundtable titled, “The State of SRE.”
“You need people that think about systems holistically, such as how to maintain uptime and measure service-level objectives (SLOs), especially in increasingly complex environments,” he said. “You also need the company to buy in.” This means admitting that the organization can do a better job of managing risk and uptime goals, even being willing to halt the rolling out new features. A tall order.
Schallert joined Techstrong Group moderator Michael Vizard, and Uma Mukkara, head of chaos engineering at Harness; John Turner, manager of customer engineering at strongDM; and Nung Bedell, SRE/customer reliability engineer at Fairwinds, for a lively discussion of how site reliability engineering has changed from buzzword to coveted position and the role observability plays in the day-to-day life of an SRE.
The Evolution of SRE
Google manager Ben Treynor Sloss is widely credited with establishing the first SRE team in 2003 to address a few challenges:
- Web-scale reliably needs with little to no downtime or latency
- Massive, distributed infrastructure (before public cloud popularity)
- Lack of DevOps for code-driven reliability management
Since then, complexity has accelerated. Cloud computing, microservices, continuous delivery pipelines and automation are central to maintaining competitive business advantage.
“There’s a huge need for maintaining uptime,” Mukkara of Harness said. “Systems became complex, so organizations of all sizes are setting up SRE practices and getting what is needed for SREs to be thriving.”
Mukkara has seen a lot of developers trying to become SREs, which he believes is a good sign of the SRE function innovating.
Turner of strongDM noted wider interest.
“I’ve seen strong developers that have an interest in infrastructure and also the opposite, including from the IT and operations side, with an interest in how to deploy, how to understand uptime and measure SLAs and SLOs.”
He doesn’t believe SREs necessarily have to be strong developers to start because so many tools exist to help. Rather, he noted that anyone interested in becoming an SRE must have an interest in how things work and how things connect together as well as general interest in technology and curiosity.
Because the journey to the job is somewhat unique, none of the roundtable participants see a need for a specific SRE certification. However, they did acknowledge that adding interview questions about debugging and the handling of contrived failures could improve hiring success.
Schallert noted, “It’s definitely a chicken-and-egg problem — you get the experience by seeing it, but you have to get your foot in the door and see it in the first place.”
The experts hoped that more people, including women and other underrepresented minorities, with transferable skills would apply.
“Diversity will continue to expand our thinking,” Turner said, with Schallert adding, “It’s our responsibility as an industry and as a role to take that chance on people and help level them up. It’s the right thing to do and it benefits everyone more so than excluding people.”
Establishing the SRE Function
The Google SRE book definition is popular, but most organizations don’t have Google-sized problems, so they don’t need to copy the Google SRE function.
“That’s actually a good thing, “ explained Schallert, “Organizations are realizing how to apply SRE practices to their own business and organizational needs to come up with what it means to them.”
For organizations interested in understanding whether there is a best model for the SRE function, how to keep engineers engaged and how automation and tools fit in, the panel offered insights.
“There are different models — embedded SREs versus a dedicated, separate function,” said Schallert. “When I was an SRE at Uber, I saw a mix of both. In all cases, it’s really important to push SRE as a partnership between developer teams and SRE teams or embedded SREs.”
Without that partnership with accountability, there can be a feeling of people throwing software over the wall and making another team own the deployment, reliability and everything.
“The way to have your cake and eat too is to have those teams be partners and agree on their operating model,” said Schallert, advising organizations that SREs should be helping teams along, as opposed to taking ownership at some point.
“Have both teams involved in the whole end-to-end lifecycle of managing your software,” he said.
Mukkara agrees. “You start out as a separate function [engineering teams and SREs], but in the end, the success will really depend on if the accountability is spread across all the teams or not.”
Where Turner sees overlap is when “we have DevOps and then we have SRE, and they perform two different functions, but there’s a lot of crossover. I think they’re going to slowly start to separate, but then figure out where that middle piece is, and who owns what.”
Like agile development implementations, ultimately few organizations will have the same SRE practice or processes. Teams have to figure out what works best for them, according to the experts. Agreeing on, and then documenting or codifying the SRE vision, is a good first step, and that includes defining service-level indicators (SLIs) and SLOs.
What about Tools?
Tools support team accountability, the roundtable participants concurred.
“Some of the ways that you can build shared culture are by teams using the same tools,” explained Schallert.
For example, your SRE team can develop the deployment software that both the developers and the SREs use. Or they can work together to develop the alerts and build experience with their monitoring systems. With everyone using the same tools and interacting with applications in a similar way, the tool set also helps to build that culture of shared responsibility.
“There’s a lot of internal tooling that’s been created by development teams and SRE teams,” according to Bedell. “There’re also a great deal of open source tools released by these companies that benefit the entire tech community.”
The experts mentioned observability and open telemetry tools as useful in supporting goals and kickstarting SRE practices toward success.
“Observability platforms have improved a lot,” said Mukkara. “There’s a lot of focus. There’s [been] a lot of investment in the last few years to get observability to the next level.”
When it comes to incident response, automation is key.
“The function of the SRE is to make sure that you are doing less manual work and eventually doing a lot of automated work,” said Mukkara.
He encourages teams to practice chaos engineering — a concept that involves intentionally causing failures to ensure that distributed systems are designed to be able to maintain high levels of availability.
SREs On Call
At many companies, taking your turn at the on-call rotation is one of the defining characteristics of being an SRE. That doesn’t mean that SREs should be responding to every single page, warns Schallert.
“If SREs are responding to every page, they become professional firefighters,” he said, again advocating for shared responsibility. “That’s not good for them or the organization, and it’s why in the past few years, there’s been a shift in developers being on call for their own services as well.”
For example, he said, “If someone is deploying a new release or something they worked on a feature for and that feature is broken when it rolls out, the people that worked on it are better suited to debug it and be involved in that incident than throwing it over a wall and having someone else figure it out.”
Turner added, “In that shared responsibility model, when you have that understanding, that communication, that cooperation between the teams, you also see lower time to incident resolution.”
How to Avoid SRE Burnout
SRE teams can be responsible for solving big and small problems, troubleshooting major incidents and performing routine operational tasks. Because reliability is such a high-stakes issue for many companies, and the tendency for SREs to take on the bulk of on-call work, it’s one of the most susceptible positions to burnout in the engineering organization. The keys to avoiding burning these professionals out, according to the roundtable participants, include:
- Effectively understanding and evaluating risk based on business impact
- Establishing clear goals and buy-in around SLAs
- Prioritizing issues and deciding which need to be resolved immediately
- Empowering SREs to push back on timelines and priorities
- Having a well-staffed team — not a single individual with sole responsibility
- Collaborating on issues
Schallert believes it’s important for organizations to begin quantifying the impact on the business because it “makes it really easy then to get other teams on board with taking the time to fix those issues.”
“You have to resist the urge to classify everything as a hot, high priority. Burnout happens quickly when every issue is your issue,” said Bedell.
Observability platforms can also help reduce SRE stress and burnout. The experts concurred that logging, metrics, alerts and messaging are important when things are failing or not coming out of the pipeline correctly.
In echoing its importance, Schallert cited a real-world example: “One model that we found really useful at Chronosphere is having those same signals throughout the pipeline. As a [code] commit is making its way to production, it goes through environments where we have the same alerts and SLO monitoring that we have in production setup on these environments. It doesn’t page a human if something goes wrong, but it tells the release tooling to stop letting that build make it through the pipeline until the fix is added.”
Tooling can empower SREs to better know, triage and understand issues.
“The difference between what monitoring versus observability is,” said Schallert, “is the difference between helping you get more and more narrowed down to the root cause of an issue and showing you what components were involved, as opposed to just knowing something’s wrong.”
Learn More
What is the state of SRE? This lively discussion points to a practice and role that is vibrant and evolving. Listen to the full recording to learn more.