Rescuing on-call engineers with observability

On-call engineering is a hot topic as organizations look to maximize developer productivity. Many times, the on-call experience can prove to be stressful, anxiety-inducing and can lead to burnout for teams.

So, how can teams gain better support and efficiency while they work to remediate and troubleshoot problems? In this on-demand webinar, Rescuing On-Call Engineers: Unleashing the Power of Observability, Principal Developer Advocate Paige Cruz and Senior Software Engineer Matt Schallert, learn common challenges, alongside tips and best practices for reducing on-call engineering stress.

If you don’t have time to sit down for the webinar, check out a full transcript below.

Battling alert fatigue

Paige Cruz: For the next hour, I will be your host, Paige Cruz. It was my personal experiences with on-call and alert fatigue that led to my early retirement from site reliability engineering (SRE). Today, I’m a very happy, not on-call, Developer Advocate here at Chronosphere, delighted to be joined by my colleague Matt Schallert, Senior Software Engineer, who has staved off burnout and holds the pager for Chronosphere today, delivering all those nines of availability our customers rely on.

Today, we’re really talking about the stress that most engineers are living through when it comes to on-call responsibilities, and how observability can save them. But, before we talk about observability, we have to talk about monitoring. Those are kind of distinct concepts that get confused. Our best way to explain that is with an example.

Imagine that you’re just relaxing and reading a good book, when all of a sudden you hear a smoke detector firing in another room. Your first question is probably: “Why is it firing?” All you have to go on from this other room is the beeps telling you that something is happening. It could be just a notification that your batteries are running low and you have to replace them. It’s the same beep as: “Hello there’s a fire.” Kind of confusing. It could be that your roommate did burn toast in the kitchen, and there’s a little bit of smoke in the air, but there’s not a risk of a fire engulfing things.

It could be that someone forgot to run a fan during a steamy shower, and this is a total false alert — there was no danger at all. Or of course, your beloved cat could have started an actual fire in the entryway by knocking over a candle. If this were you, Matt, you’re sitting reading a book, you may know that your roommate’s burning toast often, or you know that there’s some false alarms with the shower, what is your thought? Why would you think this fire alarm is going off? Are you worried? Are you skeptical?

Matt Schallert: I would definitely be worried, because I don’t like fire. I would assume that I had done something and forgotten my toast. My first question would be: “Where is it?”

Paige: That’s a great first question. And really, if you’re thinking this to yourself, you don’t actually know, because you need to investigate. You don’t know where, you don’t know if it’s a false alarm, and you don’t know how urgent this is — are the curtains already on fire, or was it just some toast? This is representative of monitoring, where the smoke detector is looking out for a known signal, smoke particles in the air, but it could be smoke from a literal fire. When we’re taking this metaphor into the engineering world, this could be unavailable services or an impacted customer experience.

But at the end of the day, a human being is needed to investigate and mitigate any issues. And this follows our troubleshooting loop, when engineers respond to pages: Know, triage, and understand. Matt, is this the loop that you go through when you get either an alert or a page – getting the lay of the land, triaging the scope, understanding and acting?

Matt: Yeah, definitely. And one doesn’t happen without the other two. I’m not going to know to go check how bad things are without the monitoring part of it and knowing first. But also, like you were saying, just knowing that something’s wrong isn’t helpful either.

Paige: This experience of not knowing if it’s a false alarm, or maybe you have a roommate that just can’t cook and is burning toast every day, and you’re getting lots of valid alerts, but not necessarily indicative of huge problems – this is what we call “alert fatigue,” where either the volume of alerts is just overwhelming and desensitizes you, or there’s just enough false alerts that you start to lose faith that the alert is even telling you something important.

I’m curious, Matt, have you experienced on-call alert fatigue on a team? And if so, was it just for a few months when you were launching a feature, or was it something indicative of a larger cultural issue?

On-call alert fatigue horror stories

Matt: I’ve definitely experienced it. Most of the times have been luckily short term where there was an end in sight. Like: “Oh, we are getting faster hardware for this service, or something else.” But yeah, it’s really rough because you start to just have anxiety anytime your phone buzzes, or if you’re going to step away to go make toast that you’re going to burn or something, you’re like: “Am I going to miss a page?” It’s not a fun feeling.

Paige: It’s kind of like a heart dropping moment. I live in a multi-story house. So, if I leave my phone downstairs, and I go run up to get water, you have that moment of panic, like: “There could be an incident right now, and I don’t even know.” This stuff affects our physiology, for sure. And how bad is this problem today? Well, pretty bad. Earlier this year, our company conducted a survey of cloud native engineers and found that 59% of these engineers said half of the alerts that they receive from their observability solutions aren’t actually helpful or usable. It’s that smoke detector that’s going off, and it’s not telling you any extra detail or helping you narrow down if there’s an issue, and where it’s at.

This was a surprising statistic to me, because it means that we are wasting so much human capital and attention investigating issues that may not even be there and our engineering time and brain space is very precious and we want to protect that. So, this is a stat that if it resonates with you, you are in the right webinar.

When we think about alert fatigue, it’s not something that just happens out of nowhere. It’s a bit of a slow progression. And so, this is an anonymized version of what I have gone through on a couple teams in the past – where I join a rotation as a new engineer, and I go to the alerts channel, that’s the first thing I do as a monitoring nerd, and I say: “Oh, there’s a lot going on in here. Is there an incident?”

People say: “No, those are the normal alerts, don’t worry about it.” I worry about it. Because I like to have pretty good alert hygiene, and I want my alerts to tell me when there’s real issues, not fake me out. So, I kind of put that in my back pocket, I wait to onboard, I join a rotation, and I think: “I will fix on-call because I have to, because now it’s my shared responsibility.”

The consequences of burnout

Paige: But over time, unless there are team efforts or a cultural shift into improving the on-call experience, it’s not a job you can do solo. So, I’ve experienced sliding into fatigue – where I just think: “I have to get through the shifts. I’ve got to pass the pager baton to the next person. And, I’m just kind of living through it.”

Over time, if really nothing changes, or as Matt said, maybe there’s a new initiative or some greenfield work that’s really critical, but doesn’t have alerts tuned yet properly, that can lead to burnout; that additional load. And so, that’s kind of where folks check out and think: “Nothing’s ever going to change. Why should I even try?” And you sort of perpetuate this really stressful on-call rotation.

Matt, what’s been the impact if you’ve seen a team member affected by alert fatigue? Is it something that you can pick up from just normal interactions with them?

Matt: I think it’s interesting because I think it depends how far you are on that scale yourself. I’ve been on the same rotation as someone else before where I thought it was fine and they were new to on-call or something. Like you said, they were like: “Oh, is this important? Do I need to fix this?” But I think it’s also kind of sad, because I’ve also seen before, people that do come in bright eyed, like: “I’m going to fix this.” And they slowly just get absorbed into: “Oh, it’s actually not going to change.” It’s heartbreaking when someone asks you and you tell them it just happens.

Paige: That’s our normal level of alert noise. So, we will share those results: we’ve got a lot of yeses. This is something you’ve personally experienced, and so, no wonder you are invested in how we can change this for your organization or your team. We have a couple folks who have been lucky enough to skip the alert fatigue, but they’ve seen their teammates go through it, which is just not a fun experience, either.

Once we’re at that final edge of alert fatigue turning into burnout, how does that really affect your organization? What are the consequences there?

Engineers that are burnt out are, in general, less productive. They’re exhausted. It’s hard for them to complete deadlines. Maybe we started to miss some projects. And the overall capacity of your team to deliver work is lessened, because this person cannot operate at the same level that they’re used to.
Not only does the quantity of work decrease, but also the quality. This is where, unless you’ve got really strong PR checks or a very robust testing system, if you’re relying on engineers to do a lot of manual verification or hold a lot of checklist state in their head, this is when burned out engineers can start to forget those details. What that really results in is a lot of hot fixes or rollbacks. Your team is spending less time delivering and more time fixing work that has already gone out the door.

Matt: This one stands out to me so much because it’s such a spiral of doom, self fulfilling thing: Your engineers are burned out, so you introduce more defects, and that causes more time responding to issues. This is a really dangerous one.

Paige: And this is where it really starts to affect not just the individual engineer who’s burned out, but that whole team has to kind of come in and do that support work and fixes. And, our unfortunate final phase of burnout:

You have a fork in the road where you can decide to stay at the organization. And if you do, the issues that have caused your burnout are likely not getting addressed. You will continue to decrease in quality and quantity of work, and in general, affect team morale.

I definitely know in the worst stages of my burnout, I was not the best teammate that I could have been. I did not have a lot of spare capacity to do the typical mentoring or the ad hoc PR reviews, and so it did not just affect me, it affected my entire team and the people I collaborated with.

How to address the broken troubleshooting process

Paige: We’ve painted this picture of doom and gloom, but I promise you, we did not come here just for scary stories around the campfire. We want to talk about how you address this problem. And really, our antidote to this is including and increasing observability. So, we’ve talked about monitoring the smoke detector, kind of just that signal to say are there smoke particles or dust particles in the air, yay or nay.

What is our difference between monitoring and observability? Well, going back to our lovely example, in this case there are some new smart detectors out there which are pretty awesome, that have enhanced capabilities like including an audio and video feed of the room where the smoke particles were detected, as well as different levels of urgency for notifications. The info: You’ve got to change the batteries. The warning: We see an increase in smoke particles. You’ve still got time to grab Beatrice and get out the door. Or that red: We are past the point, you’ve got to leave now. Total urgent alert.

And what this would look like, instead of the beep, beep, beep, interrupting your lovely novel, you would have this nice ding because we can see it’s a yellow warning, we’re not in the red yet, and a pleasant, calm voice that says: “Heads up, there is smoke in the entryway.” All of the sudden, with your notification you get the context of what room it’s in. And maybe, if you know that it’s the kitchen, and you look at that video feed and you see Benny in there making his grilled cheese, you know not to worry about it. And you really didn’t have to get up from where you were from your vantage point, because of this additional contextual data. The video feed is also helpful. If Beatrice really did knock over a candle, you could start to see the flames going up the curtain. You would have time, since it’s in the yellow zone, you’d have more time to act, which is very precious in the heat of a fire incident or computer server incident, as well. The more time we have to be proactive about mitigation, the less stress the investigation is and the better results for customers and yourself.

So, it’s really not monitoring versus observability, but it’s monitoring and observability together. Matt, I don’t know if you’ve ever had an engineering job where you’re not on-call. It seems to be just part of the normal responsibilities of a cloud engineer.

Finding zen and the “follow the sun” model

Matt: I’ve had the first few weeks of a job where I wasn’t onboarded to the rotation yet, but I’ve never had one where I wasn’t on-call. And honestly, I do think that’s a good thing, because I think it gives people more of an understanding of how systems work as a whole, as opposed to just their little silent view of it. But no, I’ve not yet found the zen of not being on-call.

Paige: No pagers. I think the closest to Nirvana that I’ve heard is the “follow the sun” model — where, if you’re lucky enough to work at a globally distributed company, your on-call hours are just your normal working office hours. That was always the dream for me.

But at the end of the day, whether it’s during office hours or also in your personal nights and weekend time, we can’t really escape on-call for our systems today. And there’s actually a lot of other things I’d rather be doing than being on-call. I would rather be spinning wool, playing with angora bunnies, spinning yarn. Matt, what is going on with the jeep? You’ve got a lot of activities that are not responding to SEV1s.

Matt: I moved to the Pacific Northwest two years ago, so this is my Seattle starter pack of my dog, my bike, and beer. The Jeep was our one-off Safari, but the rest of them are definitely my normal activities.

Paige: I love it. And then, we really want to show there is a human impact to on-call. I am not an engineer anymore. I have decided to leave that career path because of my experiences with burnout. And we want to protect the Matt’s of the world and the other Paige’s that are out there from this burnout fate.

Why invest in observability?

Paige: So, why invest in observability? Well, we’ve talked about how monitoring alone doesn’t give on-call engineers that necessary visibility to quickly know, triage, and understand. Our status quo approach is clearly resulting in fatigue and burnout, jeopardizing your organization’s goals. And observability helps this, by bringing the relevant context to your existing monitors, so that people can very quickly troubleshoot. That’s the why. Now, we’ll move on to the how. Tactically, how to get you there.

We’ll start with a quick poll, because one thing engineers have to contend with is using a variety of different observability platforms and tools. It may not be that one smoke detector in the corner, but it may be five different sensors set up all through different ecosystems.

Matt, I’m curious what the most amount of monitoring and observability tool sets you’ve had to contend with before.

Matt: Definitely in the five plus bucket, might have been six or so. This was also at an organization that had a reputation for kind of every team building something their own way. There was a lot of overlap, and old systems that the migration hadn’t completed. But there definitely were at least five in play.

Paige: That is wild. I thought I had a lot juggling three, a mix of vendors and in-house. At one company, I was responsible for our observability function. Two quarters into that job, I found out about an entirely different tracing tool we had that nobody had thought to bring up to me, because like you said, it was one team’s pet project. That is a lot of places to go to answer: “Why is this alert firing? Is it valid? And what’s our customer impact?”

Matt: And like you said, the five plus were just the ones that I knew about. I’m sure that there were others built off somewhere that I was thankful to not even be aware of.

The paging experience

Paige: Matt, let’s put it all together. Imagine that you’ve been paged. Walk us through your thought process and how you either use this linking or what your approach is to starting that investigation.

Matt: I think usually, the first thing I want to know, and there are a number of signals that you can check for this, is like, how bad is this thing? That kind of triggers: “Okay, am I going to finish making my coffee and then I’ll go check it” or: “We’re in full outage mode, rally the troops.” I think a lot of that comes from just understanding the scope. So, is it like: “I’m having a few errors on one endpoint of one service? Is it because this entire service is throwing errors? Is it some common dependency across services?” Usually, I can actually get that with Chronosphere monitors just because of the way that we do grouping and everything. So, that kind of gives me an idea of that problem off the bat, and then from there I want to understand, especially in the case where it’s like: “Oh, okay, one thing is kind of just having some issues, are there any stack traces I can see?” At that point, it’s really drilling down into a kind of very specific problem.

Paige: Those linkages really help. And when you were contending with over five different tools, did you have great linking between signals and tools or were they siloed?

Matt: No, there was none. And honestly, that was one of those things where, sometimes you hear people that are like superhuman debuggers. I think so much of that is just knowing where to look. When your tools do that linking for you, that is such a key thing of: “Oh, I don’t have to know the horrendous log query to find this thing, because my tool will at least put you in the right direction. And then from there you can drill down, but it’s so much better than: “Oh, I have to go to the firehose and figure out what is relevant within it.

Paige: One way is linking your existing telemetry. Another way is looking into deriving metrics from either your logs or your spans or traces. And why I love a trace metric in particular, is to construct the equivalent from like raw Prometheus metrics for, say I’m on the checkout team, I’ve got this ordering service, I know that checkout time shouldn’t really go above one and a half seconds.” It shouldn’t be longer than two seconds. I can construct a query for traces that match very easily over two seconds, maybe with or without errors, and create metrics for that. From there, you can alert on a metric, you could warn, you could do notifications, but metrics are very powerful, and I think they’re sort of undervalued in the observability community.

I know I used to do that because I love traces, but from here It’s really one click. You’ve got the query already built. Maybe you were doing an investigation. One click to create a metric for there.

Continue this portion of the chat at minute 27:10.

Increasing observability in the planning process

Paige: Our next tip, Matt, this one came from you: Increasing observability by including it in the planning process. How have you seen this done in documentations? Is it a conversation? How do we include this as early as possible into our life cycle?

Matt: I think the easiest thing is, a lot of orgs already have some sort of RFC process or architecture decision record … There’s typically some template that you publish your document based on, whether you’re creating a new service or a new feature. Just including this there, “How will you know if this thing is working?” I think it really makes people think about it. And depending on how mature your practices are, the answer might be: “Oh, we have a common service framework that’ll generate alerts for us.” And this is a simple CRUD service that just has some like stateless APIs or whatever, or the answer might be like, our system maintains an internal queue and we have to alert on how long that is or something. Just thinking about that, I think will really make you not have to go back and retrofit observability onto whatever you build.

Paige: Even just asking the question if it’s not been asked before in that phase is really great to kickstart combos. This one is a particular bugaboo for me: Quieting the page storms. Sometimes if your alert signals aren’t grouped appropriately, you can already be investigating an incident and continue to be receiving pages about the thing that you’re investigating. And while I would love to say, let’s just Perma-Mute those, or ignore the phone and focus on the investigation, when you do that, it escalates to your secondary. Or in some orgs, you’ve got that tertiary, that might be a VP or a CTO. I have been there when it escalated to the CTO. It is not fun stuff. And just heaps and heaps more stress, if you’re having to quickly act on things. So, at Chronosphere we have some alert grouping features that help suppress that noise and let you really figure out what group things are wrong.

We’ve got alerts per monitor. So, maybe you’ve got a host not reporting in. Well, if your cloud provider has a regional or AZ outage, do you really want to be alerted for every single node that’s not reporting there? Or would: “US West 1 is down, host not reporting,” kind of be sufficient? So, per monitor, per signal group, maybe you’ve got that one alert but broken up into regions or AZs, so you could just have it for logically grouping things together, and then the sort of mega one that will page for any sort of time series that crosses that threshold per time series. So, you’ve got a lot of options for making sure that the noise-to-signal ratio is good for your setup.

Continue this portion of the conversation at minute 32:12.

Expediting the incident process

Paige: All right. And finally, bringing this all home. We’ve talked about responding to the alert. Now, we’re going to move on to ways that you can expedite the incident response. And I know, Matt, that you’ve got a couple tales of the time where you either took down production or you were helming an incident. You want to share one of your favorites with us today?

Matt: The worst screw up I had was when I messed up the configuration of the sudo file across an entire fleet. So, suddenly all of these CronJobs and stuff couldn’t run on hosts, and I think it got up to 60% of the fleet or something, before I realized, and then found a host where it still worked that I was able to use to jump in to revert it. We did not get an alert on that, except until someone was like: “I can’t run a command that I have to.” To top it all off, I was an intern. So that was the one that sticks out to me the most.

Paige: My favorite story, which I’ve taken on the road to a couple conferences, is that I had already been on-call for five years. All my friends had stories about the time they took down Prod, and I just felt like my incident was lurking in the shadows, like it was waiting to surprise me. I was making a change to increase observability. I was trying to change our traffic logs from unstructured text that I couldn’t do queries on, to beautiful structured JSON logs. Not a really big deal. It’s a one line config change. But, through the layers of Argocd app of apps and GitOps repos pointing to each other, I accidentally overwrote, for production only, the security sidecar that actually admitted requests from the outside world into our clusters.

It was one of those immediate every team’s reporting that their alerts are going off. I was so upset, because it had worked in all the other environments, but there’s really no place like prod. Anything is never an exact replica and you’ve got to just be prepared for that stuff.

Investigating and mitigating

Paige: I like to have as much lead time as possible to go investigate and mitigate. That really helps bring the stress down. That is definitely one of my favorites. Another one we had talked about, which I see a lot of companies that have a little bit more of a mature observability strategy, is taking the common languages and frameworks and having either a central SRE team, platform, internal tools team, package up monitoring, alerts, dashboards and metrics, traces and logs, and all that good stuff; just out-of-the-box

Matt, what types of orgs have you seen this be successful, and have you been there when they’ve been rolling out, or did, or was it already a package when you got into the rotations?

Matt: I’ve seen both. I’ve seen it when it’s already in place and when we’re trying to put it in place. I think, like you said, you do have to have something common, whether it’s a common RPC framework that your orgs use that emit the same metrics and then some templated alerts, or maybe your org is mature enough to even have some abstraction of serverless microservice thing where you put some files in GitHub and it takes care of packaging it for you. Either way, some sort of standardization is necessary. But if you do have that, then once this goes out to prod, the alerts will be templated out and you’ll get alerted if something’s wrong.

Actually, a way that we do this at Chronosphere is using labels to indicate the team that owns the service that’s emitting metrics, and then you know uncaught service errors or something that’ll still go to the team that owns it, based on those labels.

Paige: I like that. I see a lot of value just coming from the SRE side where I’m a lot closer to, say, the Kubernetes infrastructure or the CI/CD system, being able for me to use my subject matter expertise for those layers to give this sort of monitoring and the view that engineers will need to troubleshoot, without them having to know all the gory details of how things are tied together and what time window to alert on. I found that to be a very good enablement when you’re crossing ops to software engineer boundaries.

Not all observability vendors are created equally

Paige: And our next tip: Choosing your observability vendor wisely. Of course we were going to throw this in there. Not all vendors are created equal and if your tool is not up to the task, you’re spending lots of cycle time, maybe it’s sluggish or slow. Definitely start making the case and getting the data you need to get that vendor out of there and figure out who might be a good choice (perhaps Chronosphere!)

Like we said, this is a fact of life for cloud native engineers, and really, all software engineers today to be on-call, and it should not be a burden. And it should not be taking Matt away from his adorable dog, and me from snuggling Angora Bunny. So, making sure that the vendor you have today meets your needs. Vendors might have met your needs in the past, but if they haven’t grown with you, they’re not supporting you, look around. There’s lots of options these days.

And we’re getting to our end here. On the note of vendor support, something that really helps is having your telemetry and observability set up for the engineer’s point of view. We also tend to think of things in services. We tend to think of things in teams. I may be getting alerted about something that Matt’s team was doing, and it’s really handy to know what that team owner is, maybe come take a look at some common dashboards. The more that you can enable investigations across team boundaries, the quicker you can jumpstart those investigations. Maybe Matt’s team had a problem that they did not notice. If I can go to their team’s page, send them a couple links to dashboards or queries based on what I could discover, that helps him get started resolving issues. I sort of think of it as a service catalog, but I think that’s a bit of an overloaded term now. But just have your tool think in the same way engineers do.

Annotating your monitors

Paige: Our almost final tip is to annotate your monitors. Matt, do you believe that every alert needs a runbook? I’m pretty extreme in saying yes.

Matt: I don’t think so. I guess it depends on the alert, but I’ve definitely seen runbooks before where someone was like: “Oh there’s a linter that says I have to have a runbook, so I just created it and it’s the title of the alert and then deal with it is the body of the runbook.” So that’s why I think I’m a bit cynical.

Paige: Okay, that’s totally fair. We’ll go through our next couple tips. In terms of on-call onboarding, how important is your onboarding period to being a successful on-call investigator, Matt?

Matt: I would say it’s crucial and also I think having it be realistic is important. My favorite approach is someone shadows on-call for a few days or a week, or something, and then spends most of their onboarding reverse-shadowing, where they’re paired up with someone who is also on-call and expected to have the same reaction time as the person onboarding, but they are the ones that are kind of like the front line dealing with the alerts, with the support of someone else who’s been on the rotation. I feel like it gives them a lot more sense of confidence and ownership in dealing with issues.

Paige: Totally, and that’s a really great time to get that fresh set of eyes on your rotation. Maybe you’ve become normalized or desensitized to alert noise that fresh engineers take on. That on-call overwhelm or duties is really good data to capture, and sort of bring if you’re looking to make changes to on-call.

Explore your system architecture

Paige: And our final tip, along the terms of really creating, not stressful actual incidents, but the ability to practice coordination, incident coordination, learning your ins and outs of how to find data in your observability tooling, and even just exploring the system architecture in a more real world sense. I have had great success with Game Days, using ToxiProxy to simulate external dependencies being down, exceptionally slow, adding jitter. There are a lot of fun ways to break your system, and when you can do that in a low stress, probably not production environment, that can really help hone these skills.

Continue to the Q&A portion of this conversation at minute 48:18.

Additional resources

Curious in learning more about Chronosphere, on-call, and incident response? Check out the following resources:

Recent News

Featured Resources