The secret to reducing on-call engineering team stress

A brown paper with the word secret written on it, aiding in reducing team stress.
ACF Image Blog

Five ways to will help alleviate your on-call engineering team stress and avoid burn-out.

 

Chronosphere staff
Peter Simkins
7 MINS READ

Feeling the heat

Chronic workplace stress burns employees out. On average, one in four employees globally surveyed in 2022 by the McKinsey Health Institute report experiencing burnout symptoms. Even engineers who are passionate about their jobs can feel stressed at work. As a tech leader, you can do something today to significantly reduce your on-call engineering team stress, and it’s relatively simple—transition to modern observability.

Site reliability engineers (SREs) are the guardians of production system health for those on the business front lines. They oversee and remedy outages and vet changes developers want to put into production. Technology operations can suddenly cease without them, causing major disruption to your business. Failure is not an option, so getting your on-call engineers the data they need quickly is critical to fixing and implementing a longer-term resolution.

However, the recent growth of cloud native and the explosion of containers means the escalation of an important issue often ties up one, even several, of your developers or engineers tasked with finding a resolution—instead of getting home on time and spending time doing what they love.

While still trying to “do more with less,” SREs are typically working more hours with fewer people, and their day-to-day concerns now involve not only remedying urgent issues but looking for sustainable ways to prevent downtime, including:

  • Setting up query-based alert filters to reduce the burden of creating hundreds of monitors
  • Easily creating dashboards that inform the right people at the right time
  • Stopping sprawl and federating data for more business insights

All observability platforms provide on-call engineers with some task relief, yet a closer look at the most popular ones reveals shortcomings. Until now, the observability industry hasn’t been innovating enough to improve the stress of engineers taking front-line calls.

For example, organizations looking to standardize on open-source software (OSS) like Prometheus and OpenTelemetry (OTel) will often run their own observability solution in-house. However, the care and feeding needed to keep these DIY systems up and running creates busywork for your on-call engineers by making them connect the dots to know, triage quickly, and understand the issue at hand. That’s where running in-house open-source solutions fall short compared to SaaS observability platforms.

How cloud native observability sets your team up for success

The simplest way to give your on-call engineers back control and shift left—getting SREs (not on-call engineers) back to managing service-level agreements (SLAs)—is by adding a control layer. A modern observability platform fully enables your on-call experts by taming rampant data growth and cutting cloud native complexity. It increases confidence because your engineers can more effectively assess and operate the highly available and resilient applications powering business success.

Cloud native observability gets your on-call engineers to the goal of rapid remediation faster. By putting the tools and insights that your front-line engineering teams need at their fingertips, your organization also increases transparency and information availability for developers to approach incidents autonomously. It also lowers reliance on other team members, reducing frustration and empowering individuals with data.

Here are five ways leaders can leverage Prometheus and OTel data to reduce on-call stress:

1.     Give your engineers a global view of all of the data they need, when they want it

Because fixing issues fast is a top priority, on-call engineers want to be able to access just the data they need—not too much, not too little—when they want it. Focusing only on the inputs and data (i.e., metrics and traces) doesn’t necessarily help teams navigate to solutions faster. Instead, it can slow them down and drive up costs, unnecessarily increasing mean time to resolution (MTTR). Teams with an automated way to focus on observability outcomes work faster and more accurately. A cloud native observability SaaS platform provides a global view of data, making it easier for engineers to track where all metrics are located in a single data store. On-call engineers receive alerts quickly with the right amount of data—without having to federate between multiple instances of Prometheus to view data centrally and query against specific data points. With faster access to the data they need, your engineers can accelerate the time to triage and resolve critical issues, and in the process, get some nights and weekends back.

2.    Change your policies to keep data, especially metrics, longer

Prometheus is optimized for storing data in the short term—which is good for near-term issue resolution. The problem is not all issues are open and shut cases, and no engineer wants to waste time recreating the proverbial wheel. So how can you give your team members down-sampling capabilities that allow them to view and retain data longer? A cloud native observability platform saves your engineers time today and tomorrow—engineers can store service metric samples for extended periods in a single place for querying or alerting while also having the capacity to save granular, short-term metrics. Regarding traces and logs, the retention times should match the business needs with your observability platform supporting you, allowing easy retention flexibility.

3.     Ensure your engineers have a highly available solution, delivering both real-time and historical data

To effectively troubleshoot mission-critical services, your on-call engineers need real-time service monitoring and access to historical data. Yet most self-managed OSS observability solutions aren’t highly available by design, meaning if they go down, teams can lose valuable time and money before an issue is resolved. If you can’t afford to lose any telemetry, adopting a fully managed platform is for you because it prevents data gaps and node failures as front-line engineers service mission-critical applications that demand ultra-fast debugging. Remember, an observability solution isn’t just collecting data, it’s also presenting data, along with highly available APIs to connect tools for visibility into your business.

4.    Automate to help them focus on fixes and innovation, not keeping-the-lights-on work

Although your current team might easily have deployed a single or a few instances of a self-managed OSS observability solution, fine-tuning and scaling clusters as your business grows and matures will add even more work to already long to-do lists. Need additional instances (storage, compute, etc.) to scale? That means more on-call engineering support to manage nodes, including maintaining awareness of the data within each node. You can lower overhead management and empower your on-call engineers to spend more time on core application fixes and innovation even as your cloud native data grows with a fully managed observability solution. A best-in-class, cloud native SaaS observability platform replicates each data point 3x and stores the data copies in geographically dispersed regions without IT involvement. This can lead to faster remediation as well as better MTTR and customer experience.

5.    Provide air cover — cost transparency and customer success teams

When I worked at Disney, we had many tools representing a capital expenditure cost, depreciating over time. This model worked very well when we weren’t running thousands of containers, but we had to develop a simple operational cost model that limited overages. This allowed our business leaders to forecast a solution’s maintenance, support, and cost. With self managed OSS-based solutions, your engineers can take advantage of open-source contributions. However, when issues that need immediate attention arise, a modern platform with professional support is more reliable. Moreover, high monitoring costs with open-source platforms can force your engineers to make uncomfortable tradeoffs between cost and performance. Show your on-call engineers that you have their backs by investing in a cloud native observability SaaS platform.

As you establish goals for your SRE team, consider incorporating the stability and security of time-tested tools while also leveraging open-source observability solutions to address today’s cloud native scale. Your on-call engineers will appreciate (and maybe even thank you for) the stability, scalability, and openness this new model can bring.

Learn more about cloud native observability

Scaling cloud native can be complicated. When on-call engineers are required to use a self-managed OSS solution based on Prometheus or OTel, or a legacy monitoring solution, they lose visibility and cost control while being locked into vendor decisions. You can significantly reduce on-call engineering stress (and boost job satisfaction!) by transitioning to a cloud native observability platform.

Learn more about cloud native observability and Chronophere.io.

Share This: