Here are 3 keys to faster application troubleshooting

Two men focused intently on their computer screen, surrounded by a green overlay featuring a magnifying glass and exclamation mark symbol, exploring faster remediation techniques to advance their work.
ACF Image Blog

Observability experts deep dive into common cloud native challenges, and how you can help your engineers avoid lengthy bridge calls.

Sophie Kohler | Content Writer | Chronosphere

Sophie Kohler is a Content Writer at Chronosphere where she writes blogs as well as creates videos and other educational content for a business-to-business audience. In her free-time, you can find her at hot yoga, working on creative writing or playing a game of pool.

39 MINS READ

Tired of endless bridge calls that keep you up at night? Today, modern teams and organizations are facing mounting complexity in a cloud native and microservices world. Oftentimes, this results in the dreaded 400-person bridge call –  where engineers may find themselves facing long application troubleshooting sessions, scrambling to diagnose or resolve an incident. 

In this blog, learn three tricks from Chronosphere’s observability experts for speeding up problem diagnosis: 

  1. Aligning your priorities 
  2. Embracing change 
  3. Adopting hypothesis-driven troubleshooting 

You can check out the on-demand webinar here, or catch a transcript of the conversation below. 

Current challenges we see our customers face

Julia Blase: Thanks everyone for joining us today. I’m Julia Blase, really excited to have you. Here’s a quick overview of what we’re going to talk about in this webinar. So, first, we’re going to talk a little bit about the market landscape and challenges moving to cloud native and working in microservices environments. We spend a lot of time talking to people in this industry about their problems, and what they’re doing. We seem to find that they fall into kind of three main groups, and we wanted to share what we’ve learned with you all, and hear how that resonates with you as well. We’re going to end that section with kind of where we see people getting stuck today, which is, as we call it, the dreaded 400-person bridge call.

We talk to customers who say they do end up in this kind of situation once or twice a week. Obviously none of us want to be there, and we want to talk through how that comes to be the case. And then, we want to move into how you can avoid that. So we see three keys to faster application troubleshooting, and really avoiding that bridge call altogether. Once we go through those, we’re going to touch briefly on how Chronosphere can help, and how the tools we build help address those three key components of faster troubleshooting. We’ll give you a quick demo of what that looks like in action. All right, let’s go ahead and get started. So first, let’s talk about the market landscape and challenges.

Today, what we see is companies operating with a ton of really high scale workflows, and that is high scale both in infrastructure, just like running on a very large set of infrastructure to handle the load, and high scale in terms of users or customers, right? So, if you’re facing a large customer base, you’ve got a lot of individual users who can all have different experiences with your platform. That can also be high-scale in terms of the data that you’re generating. If you are going through a typical day, you might generate petabytes of data where you’re used to generating terabytes. And in this new world, what we see people struggling with are three kinds of problems.

  1. Poor reliability: You have a system, you are trying to manage this high scale workload, and it goes down at exactly the wrong time, right? You’re an e-commerce retailer, it’s Black Friday, and all of a sudden in the middle of Black Friday you lose insight into what’s happening, which can cause lost revenue.
  2. Data explosion: We see data exploding, like I said, both in volume and in complexity, higher cardinality, higher rates of change, which can be difficult to deal with. Even if you do have a system that’s very reliable and standing up, if you’ve got a lot of data in there and you don’t know what to do with it, that can be a huge challenge. 
  3. Change: We see things changing because developers are being asked to fill more of that DevOps role. In the past, those roles were separated, but we hear a lot of people talking about the shift left. Individual developers are being asked to do on-call platform rotations and monitoring, and are responsible for system health. But, it’s becoming increasingly hard for them to do that effectively. Why do we think this is happening? We associate a lot of these challenges with the rise in transition to cloud native development before cloud native. We often saw companies operating monoliths in local clouds or on prem, right? Monoliths are predictable, they’re slow to change, they’re pretty easy to understand, there’s a monolith. And in general, they were honestly pretty reliable because of all those factors.

But, unfortunately, the downsides of monoliths were they were very inflexible. You could only scale up by so much at a time. Scaling up was costly and expensive. Scaling down even more they were also slow to change, right? Every time you wanted to make a change to a monolith, every test had to pass again. So changes took a long time to plan. You were doing more waterfall development because every time that change took place, it incurred a lot of expense around it just for change management. And they were also really costly to run.

In a world of fluctuating demand, where we’re more plugged into customers and we have more dynamic loads in our system, monoliths became very expensive to manage against that load. The market demands flexibility and efficiency, and so in response to that, we see people moving away from that monolith structure and into more cloud native environments.

Video Thumbnail

Finding success in a microservices environment

Julia Blase: We all want to be building code. How, given this microservices world and given all of these changes, do we manage this? So at this point, Nate, I think I’m going to hand it over to you and you’re going to talk a little bit about what we’ve learned in terms of keys to success in this new environment.

Nate Heinrich: Thanks, Julia. All right, let’s get into it. The first thing we’re going to talk about is aligning your priorities. And I like to start this with just a question: Would you rather alert on CPU, memory, network, disk usage of your service, or when your customers are actually having a bad experience? Now, that’s a bit of a false dichotomy, it’s probably a little bit of the former, but mostly the latter. So, this is what I’m going to be talking about – is that most of you will probably answer that as, I want to get notified when customer experience is suffering.

But how do we do that? We know that customer experience needs to be on target. We know the stakes are high and we know that those moments where we miss a good experience, there’s an outage, there’s something that’s impacting customers consuming our products and services. Those are the moments where they’re thinking to themselves: “Maybe I should go elsewhere. Maybe I should use something else.” 

And those are the things we want to avoid, but it can be tricky to determine how to know when they’re having a bad experience and what we should target as a good experience. It comes down to understanding what’s good enough. What should we aim for? And a lot of the time, your first instinct is: “Oh, we should be perfect.” But it turns out that isn’t actually the right answer. We’ll talk a little bit more about that. And you need to understand, what level of performance your customers expect, what level of reliability we should target for our systems so that we can navigate that space between perfect and the expectations of our customers.

Digging into a little bit about that first instinct of: “Oh, we need perfect customer experience.” I think this is one of the most fascinating things to really understand, really internalize because 100  percent reliability is not the answer. And, when we lean towards that as our first thing, you really have to understand and compare it to maybe other industries, what it costs in terms of time and effort – and also morale from a place of targeting that 100  percent reliability.

The myth about 100 percent reliability and maintaining your SLOs

Nate Heinrich: A couple of things to consider: The first is, there’s some really good research around the cost of every nine that you add to your reliability. Roughly speaking, each nine you add in terms of: “I want to be 99.9 percent available in these types of things,” and providing a good experience is 10X more costly for your team to achieve. So, 10X more in terms of time, effort, and cost. When you can start to quickly see the balance of cost-to-benefit, cost and time, benefit to customers, the more time you spend on reliability, the less time you’re spending on building the things that they want for your product to evolve.

There’s a common kind of comparison to help folks think about being okay with less than 100 percent and it’s just this interesting fact about pacemakers. Pacemakers seem like they should be very reliable, 100 percent reliable, but in fact they’re not. You can look this up on Wikipedia. In 2005, there was this research done and it turns out that pacemakers are 99.6 percent reliable and we as a society are okay with that. And there isn’t a huge problem with something so critical, having a reliability of 99.6. So, this is a commonly used fact. Understand that, 100 percent doesn’t need to be the target, and how might you figure out what your target is, and what can you do with that once you understand it.

So, we’ll talk about two types of work that are in the balance here: Only working on reliability doesn’t seem like the right answer. Customers: Your customers won’t feel like you’re listening to them. Competition can get ahead. These are the two things that could happen if you only focused on reliability and we’re targeting too high of a target. It’s also clear that only working on innovation, new features, these types of things, isn’t the right answer either. Reliability will suffer if that’s the case. So, I think there’s two ways that folks can think about it, depending on how your brain works: “I want to work on just enough reliability to maintain my service level objectives,” or “I want to work on as much innovation as I can while maintaining my SLOs.” I think both of those things can be true. And if you could think about it in either one of those ways that kind of feels natural to you, that’s the way to go. 

Julia Blase: Yeah, and I’m just reminded [that] I often hear people say the most reliable system is the one that has no users. And that’s also the system that makes no money. I love that. If you keep that in mind, when you’re building these things, there’s always a trade off, right? You have to figure out what works best for your customers, for your specific business. Because you still want them. Got to get those customers. That’s how you’re going to stay in business.

How to prioritize your customer needs

Nate Heinrich: That’s right. I love it. So I think I’ve convinced you that less than a hundred  percent is the target, but what is it? And then how do you pick the things that your customers care about to measure to ensure that they’re getting that good customer experience? There’s a couple of ways you can go about this one, and we’ll talk a little bit later about some statistics. A lot of people have adopted, you know, service level objectives and measuring that customer experience. If you’re new to it, that’s totally fine. But you can look at competition. You can look at adjacent markets. There’s expectations out there that you can draw from to just get you in the ballpark. Even better, talk to your customers, talk to your end users. 

It is a discussion, right? It is a friendly negotiation because what you’re trying to figure out is, what are you willing to tolerate in terms of a customer experience? We want it to be good for you at all times. And we also want to not spend all of our time on that reliability so that we can spend more time on innovation and get you the things that you’re asking for, being able to take those feature requests and spend as much time as we can on them and get those into your hands as quickly as possible.

These are two things that I start with. When you start to get those parameters from your customers and you start laying them out, you get that picture of: “Okay, this could be our first SLO. This is the thing that we are going to react to when there is a problem.” And to illustrate this very clearly, what these parameters look like, you can start to think about: This is actually an SLO that one of our customers is targeting. And it’s a pretty common one, right? You’ve got a few pieces of information here, a few things that you have to decide on.

We’ll walk through it so you can get a sense of this is one example of a few that we need to create in order to really monitor that customer experience. It’s a  percentage. It’s a thing that customers care about, it’s a threshold that determines what is good and what is bad, and then a window of time. And so, you can think of it as Mad Libs-ish, right? You’ve got to fill in the blanks, and you should do it in conversation with your customers. So in this case, 90 percent, 95  percent of page loads are under. 200 milliseconds for the last 30 days. Latency is a big one. We all are customers of many technologies.

I’m sure if you pull out your phone, like to use any application, that responsiveness is a big deal. So it tends to be one that folks care about. And so, what we’re saying here is that 95  percent of the time, things coming back to you that you asked for are going to be under 200 milliseconds. And we’re going to do that over a rolling 30 day period. So we can keep track to make sure that that is always the case. And that 5 percent opposite of that 95  percent is what you can consider our budget. The budget of times over that time period that we’re allowed to miss.

Defining the right SLOs for your team

Julia Blase: I wanted to chime in there with one other comment. I think like what you’re talking about here with being able to work on and define that with customers, it also helps you like define it for your organization and I’ve at least seen customers have this realization where they start to find good SLOs, especially on latency, which feels so qualitative like people can write in and say, hey, your page feels slow. And it’s what do you mean, feels slow? And that can be really hard to debug and respond to as a developer or on a product team. And I see people over and over starting to really define these. And then the next time a ticket comes in saying checkout felt slow, they can actually verify that. They can track down that specific customer ID or request, look at how it fits into those parameters and say: “Please try scram the jets to go fix something,” or “You know what, it was actually 192 milliseconds. So while it might’ve felt slow, that might just be you.” 

And I don’t have to worry about this. I don’t have to care. I don’t have to wake everyone up to come try to fix something that’s not actually happening. That’s just in the customer’s head. So, it’s a great proactive measure, but it’s also a great defensive measure that I see people using to help them focus on building new things instead of a “boy who cried wolf,” like over-responding every time someone says it feels slow, 

John Potocny: Having objective measures for this stuff to ground the team’s prioritization is a great call out. Something that was on my mind as we were talking about this, is thinking about who your customer is. This is also an interesting conversation to have here, because when you have this heavily disordered system with a lot of microservices, teams are probably going to have services that are not directly customer facing.

That doesn’t mean that this is a bad framework to use. It just means that your customer has a different scope. It’s the things that are calling your service, right? And that’s an interesting thing to make sure you’re keeping in mind and understanding that we’re not talking about just the front end, but also individual teams can use this to define the reliability of their services from an internal standpoint too.

Nate Heinrich: That’s right. And so, I use the term customer pretty loosely, but we all have customers of our service. So, it could be internal teams that rely on you as a microservice to a larger thing that actually does contribute to something that end users interact with. But yeah, we all have customers, whether it’s internal, external, and treating them the same ends up with the healthiest system that is best able to deliver a good customer experience.

The myth about one hundred percent reliability and maintaining your SLOs

Nate Heinrich: Awesome. So let’s talk a little bit about that 5 percent. Like, we’re targeting that 95  percent of page loads are going to be fast. What is that 5 percent? What does it look like? What do we think about it? Because really that 5  percent is your budget. Budgets are meant to be spent, right? Don’t, you don’t have to be precious over this.  This is the amount of time that you can have some slow page loads, and that’s okay. So, similar to “cry wolf,” this is going to happen, but this gives us some insulation so that we can focus on that innovation. So generally one, it’s a rolling period. I see four weeks, I see 30 days very often, but you think about this rolling budget, and if you have no issues, you get to gain some more budget over time, but then gives you a little bit more to spend are some of the best practices. Think about that kind of five  percent as that budget, and then what you do in order to be aware of when you should respond is how quickly you’re burning this budget at any given time. So like I said, it’s okay to spend the budget, but if you start spending it too quickly, that’s when you want to know.

And so, you end up hearing these terms of burn rates. So, as my five  percent over thirty days starts to dwindle too quickly, that’s when I want to react, that’s when I know and I’ll get alerts that: “Hey, if this is a sustained rate of burning your budget, you’re going to run out in two hours, one hour, even less.” And those are the things that you’ll set up with your SLOs to be aware of them.

Now, we’ve got SLO, we’ve targeted some SLIs, whether we’ve got a 5  percent error budget that we’re reacting to, what can we do with this? So, whether you’re thinking about it in terms of working on just enough reliability to maintain your SLOs, or working on as much innovation as you can while maintaining your SLOs, both are totally valid. Now, your service has a clear way to tell you if it is in or out of budget, and that data can help you very clearly make that argument for focusing on reliability. We are in danger of running out of budget, or we have already. We need to spend time on reliability to maintain that customer trust and keep that customer experience. That pretty much wraps up SLOs as a how-to and best practices, but I did want to spend a little bit of time on some interesting facts about what we’re seeing in the industry. These were pulled from our friends over at Noble9. They have a great SLO report. Just wanted to talk about a few of these really quickly: 

  1. Eighty  percent increase in focus on system reliability. Reliability is in many ways the single most important feature of any system. And we’re seeing that in the data and terms of the initiatives that these companies are focusing on. This year and next year, 79 or 69 percent of companies already have SLOs.So, a huge amount of the market is already using SLOs. I think that the degree in-depth varies quite a bit, but a vast majority of companies, especially those cloud natives are already using that. Love to see that. 
  2. Ninety  percent indicate SLO adoption is driving better decision making. That was the last thing we were talking about – being able to see those error budgets and being able to make objective decisions about where you spend your time.
  3. Ninety-seven  percent report that it’s difficult to manage SLOs. So, over half of the companies that are already using SLOs have built something themselves and nearly everybody is saying it’s hard to manage.So, there’s some interesting opportunity there to make SLOs even easier to use.

Wrapping this up SLOs wise, there’s a few things to take away. SLOs help you be customer-centric, help you know when customers are impacted and get those kinds of high signal alerts to your team so that when they’re alerted, they know customers are impacted, and they know they need to respond. Fewer false positives. And that decision making, I can appropriately prioritize reliability and performance projects thanks to the data that SLOs provide. 

Embracing change in the remediation process

Nate Heinrich: All right, moving on. Let’s talk about the second thing here: Embracing change. We just talked about SLOs, how they are excellent at providing high quality alerts, those high quality signals to let you know when customers are having a problem. But then what do you do? When you respond, get a couple things here. I think this is probably the most underrated yet most valuable thing you could do to help your on-call responders. I call it the kind of shortcut to mitigation and it all starts with this increase in change over the years, right?

DevOps, agile software development, it’s all about change, constant change to drive improvement, lots of ideally smaller incremental changes with a feedback loop to allow some course correction along the way. And with that increase in change, you can see it in the stats that over eighty  percent of all user impacting incidents are due to change. So with that increase in change, with that statistic, what are some things that we can do to help with that?

If we put ourselves in our SREs shoes here, and you see the error rate of a service increase, the data is trying to tell you a story. With the telemetry itself, the metric we’re seeing here of this increased error rate is maybe a couple of chapters of that story. There’s more it’s trying to tell you. And being able to bring in some of that change information completely changes how you interpret and fill in the gaps of that story. So what’s missing here, right? If the statistics are right and you’re able to bring in some of that change information the SRE looking at this error rate could see something a little bit different. And the majority of the time, it could look something like this – it could look like: “Hey, I see that my service has a high error rate, and I see not too long ago somebody changed a feature flag related to the service, and the service was deployed.” Now, the story changes completely for the on-call responder consuming it.

Now I’ve got a metric anomaly and these changes coincide. And in a lot of cases, this is the shortcut, right? This is the shortcut to mitigation and could be the first thing that you look at. It could be a quick rollback or a quick feature flag change to stop the bleeding, stop impacting customers and preserve those error budgets, right? 

So, what are some of the interesting properties of how you should think about solving this problem? How do you get this kind of fuller picture story? A number of types of changes out there. So, let’s talk about some of those key elements. 

Centralize your changes

Nate Heinrich: I think the first one’s pretty obvious is you need to put all of your changes into one place. Ideally, you need to centralize them and you need to put them with your telemetry data so that they’re accessible to everyone as they’re consuming it. There’s a number of different types of changes. We talked about two of them, so being able to categorize them and tell them apart. So whether it’s a backend engineer deploying a service, whether it’s a front end engineer changing a feature flag or an infrastructure engineer making some changes to Kubernetes or data store upgrade, being able to quickly tell the difference between them goes a long way in understanding that story.

The third piece is organization and relevance, right? We talk about all these changes happening. Not all changes are relevant to the thing you’re looking at. Being able to connect the proper changes to the service that we’re looking at, being able to have kind of a service oriented view that’s trying to tell you: “Here’s the state status of that service, and here are the changes that have affected that service recently.” So, connecting those two things together having a view and high relevance is going to get that actionability.

Julia Blase Nate, you mentioned having it all in the same place. And I feel like that sometimes gets lost. People think: “As long as I’m tracking my change events somewhere, I’ll be able to correlate the change with the trend.” But the other thing to remember is we’re talking cloud native, we’re talking Kubernetes, all those things I mentioned earlier.

The pace of change is really fast. And, each moment you spend trying to pull up another system to find the change and then have those things side by side is a moment that the trend changes, and all of a sudden you’re no longer sure which change is relevant to which trend. So, I just wanted to “heavy plus-one” on having them all in one place because you can’t afford to take the time to go digging through multiple systems to find this because it’s just going to be much harder to find the relevant change if you have to do that.

Managing change in a repeatable way

John Potocny: And, just major emphasis on understanding relevant changes for what’s going on in your system. I think about all the times that I’ve been on-call and I get paged. The first thing I always looked at was what’s broken that I’m getting awarded to, and then what are the things that could have changed that’s causing this thing to break. And that was always a mental list in my head, which is fine if the things that you can go and look at, the potential sources of change, are small. But as we talked about earlier, the more complex these systems get, the more moving parts there are, the more potential sources of change there are. 

And in a dynamic environment that’s powered by something like Kubernetes, where you’ve got pods coming up and down all the time, there’s literally constant sources of change that are happening. We don’t have to track an event for every pod that spins up. But we do want to eliminate all the other things that are happening. So, the things that I have to go and manually investigate as an on-call engineer are as small and bounded as possible so I can have that mental list in my head, and not have an unreasonable number of sources that I have to go and look at.

Nate Heinrich: A great description of that relevant piece, right? I want to focus on the things that matter, consume those, and then develop my hypothesis from there. Yeah, so wrapping things up on, tracking change and using it. Benefits really are less time triaging. We just talked about that. More, that means more time for doing other things, building value and delivering it to customers. And I think it’s a good transition to talk a little bit about what happens when maybe it wasn’t a change. What happens when you don’t have a shortcut and there isn’t something that kind of immediately helps you mitigate? I was curious, Julia, if you have any thoughts on how we could do that in a repeatable way. 

Julia Blase: Yes, I do. And it’s really interesting to look into this problem and see when, even when you do have a change, how do you figure out whether or not that’s the right thing to address? Or how do you understand whether or not that change is related at all? And what else might be driving the bad behavior in your system? What I’m going to talk about here is methodologies that we see the really smart people following over and over to help go through this part of it. To recap: This is what a day in the life looks like for an individual developer who’s on call or like maybe an SRE who’s on call: You get an alert that there’s a problem in your system. 

Hopefully, it’s based on a really well defined SLO, but you get an alert that maybe you’ve got one  percent of your error budget left. You want to know why. This is suddenly a problem: Maybe this is something that woke me up at 4 in the morning, when I went to bed at eleven, everything was fine, why is there a problem? And specifically, you’re starting to look for what’s different that has made things not okay. 

Having a reliable process that gets you to the root-cause of the problem

Julia Blase: So you’re looking for those kinds of changes, anything quick and easy, anything clearly associated that you can go ahead and take action to get things back to normal. So, when you have that set of things that you think might have caused the problem, what do you do? Because of the way people have set up their systems today, we see them getting stuck a little bit in this part of the process. They get woken up and all of a sudden they’re scrambling to find the right change, or look at the right trend, or see the right data, and they call in someone who’s really knowledgeable.

We went and talked to a lot of those really knowledgeable people. You might call them, our heroes, our experts. I bet you have a name in mind. If I’m like: “Who are you going to call?” You probably say: “Oh, it’s Ryan.” Like it’s definitely Ryan. I always call Ryan when I can’t figure out what goes wrong.

So this is what the Ryans of the world do in this scenario. They start to build a hypothesis, and it might be based on a change event they saw somewhere else, it might be based on a change event that seems really clearly correlated, it might be based on some deep secret magical food-knowledge they have about what tends to happen every Monday morning at 9 p.m. and they form these hypotheses, and then they go about proving and disproving them.

And what they see is that this is what helps them get to the root cause of problems faster. They’re not actually experts sitting there going: “Oh, I know exactly what happened. But they’re experts in going there and saying, I know exactly how to get to the answer. I have a process for getting to the answer. And that process is repeatable. And it really doesn’t matter what the data looks like underneath. I can follow the process and I’ll get to the answer quickly.” And I think that was a really important thing for us to hear and then talk about internally is like: “How do we make that process something that repeats?” 

The data always changes - but your process doesn’t have to

Julia Blase: Because yeah, the data always changes. The system always changes. The only constant thing is that there are a lot of changes like Nate just said. So, we can’t guarantee that the data is going to look exactly the same, but we can give you a really good answer, process for triaging those changes and figuring out what’s happening. We can take that deep food of the expert SRE and codify it so that anyone can follow it. I will also say today when we talk to people following these processes for understanding or disproving that a change caused a problem, it takes a long time. As we got into this, we wanted to understand what the steps were, but we saw people following these steps using really custom built tools.

They might have several query languages. They might have a bunch of Google spreadsheets with field cups where they put data to try and correlate it, right? They might have scripted things that they run over their data. We do see people getting pretty in the weeds with the different tools that they have to try to codify these processes. But again, just to emphasize, the process was almost always the same. It was something that had some really specific steps to follow. And in fact, as we look at those steps, I can tell you what they are right here. 

So, this was pretty much it, right? As we talk to the people who do really well in these environments, they always start with asking what happened: “Was there a system change like a deploy? Was there an observability change? Like maybe I added a new label or a bunch of new values for a label in my system. Was there a data or workload change? Did I onboard a new region? Did a bunch of people start coming in from a new region? Cluster, that I hadn’t seen a lot of traffic in before.” 

So, is there an actual workload or data change underneath my system? And then they take the thing that changed or maybe the set of things. So John, you were like: “I want to reduce the service area of things that I investigate.” So now maybe I only have three. So, they take those three things that are the hypotheses. It’s one of these three. And for each of them, they start to do comparisons. So they look at the things that were erroring and the things that were successful, things that were slow and things that were fast. Things now where the bad behavior is, versus things earlier when everything was calm and my error budget was fine.

The power of hypothesis-based troubleshooting

Julia Blase: And they start comparing, and they start to see if that change is common in one category or the other. So is that feature flag always in the errors and always in the slow things or not? Sometimes it actually can be easier to disprove like: “Oh, that feature flag is in both errors versus successes. Great. That’s not actually the cause of my change. Let me go back to deploy. And let me do the same thing with the deploy. Oh, when I do it for the deploy, I actually see something else. I see that there’s that region that’s actually different between those. Now I have a new thesis. My thesis is about the region. Now I can go do the same thing and I see, okay, that region is definitely associated with things that are slow.” So they take this hypothesis, they run these comparisons, see if it’s common or not, and work to prove it or disprove it. And they’re looking for outliers. 

They just want to have confidence.They want to see that something is an outlier and things that look bad so that they have confidence that they can go and change that thing and that changing that thing will address the problem. So hypothesis-based troubleshooting. We actually went then and we started to look in other industries to see where else this might be common. And the other place we found this kind of process is in the medical industry, right? And you have doctors. 

I don’t want to say being a doctor is like being a software developer. Obviously they are very different professions, but you have some similar challenges, which is you’ve got a lot of symptoms. Human bodies are very complex. Kubernetes systems are very complex. You’ve got a lot of symptoms. They’re changing all the time. They’re evolving. They’re reacting to things in their environment and you have to, something is wrong, right? Like you have a stomach ache. The system has a lower error budget.

Maybe I’m stretching this metaphor a little bit. That’s okay. But the point is, you want to use the same workflow to be able to tackle and quickly get to a probable cause, right? Like a probable outcome so you can fix it faster. And so, we really thought that was interesting to look at that medical process and experience and actually see if we can look at what they were doing and maybe take this approach with software development.

So the metaphor maybe is a bit of a stretch, but is the approach a stretch? I think, and we think not. We actually think this is a really productive way to think about triaging problems in your system. It’s a systematic approach. So it’s something that you can codify, you can document, people can repeat over and over. It services insights from your existing complex data. So it actually takes that complexity as a benefit. It’s not like I have to reduce complexity to find the source of the problem. I actually can lean into complexity and say: “Hey, I just know how to like triage and get through that complexity.” And it’s so that complexity actually now becomes a benefit.

Maybe I even want to increase complexity so that I can get more fine grained information about what’s an outlier. So we really liked that it was a process that leaned into the complexity of modern software. And it does help you find the answers and end the bridge-call faster. We talk about lowering the mean time to sleep or mean time to return to sleep, but it’s so much easier to process through some complex data, a lot of different hypotheses, change events, workload events, and get back to sleep if you have something that’s repeatable and systematic. 

And that thinking of this as a true workflow that you need to make repeatable, and is a really helpful way to finish that process, find the problem, reverse the problem, maybe take that cloud region back offline or divert that feature flag and get back to sleep.That’s ultimately everyone’s goal, right? We don’t want to be up at four in the morning. We don’t want to be fixing things. We want to get back to sleep so we can wake up, have our coffee, and go back to writing awesome code that makes our company money. 

John Potocny: And you said it earlier like when we were talking to senior engineers, this is effectively the process that they were doing to help resolve incidents when they’re getting escalated to. It sounds like the hardest part of this for most people is just the complexity of the systems and the volume of data and the number of potential hypotheses that you can have that you have to go, and know where to go in order to prove or disprove each one, right? 

Julia Blase: This existing complex data, existing complex system is a huge component of it. Even if you have put those events in the same place, and you’ve correlated them with services and you think they’re here, there’s so much information around those events. Sometimes you see multiple deploys because there’s a platform deploy. So a bunch of things went out that might’ve touched your system. Feature flags changed at the same time as the deploy. So is it that interaction that’s causing the problem? There’s just endless places that you can use to build new hypotheses. And the important thing is to triage those and move through them very quickly. Not get the right one.

Video Thumbnail

John Potocny: There’s a few things that we want to go layer deeper on that are specifically relevant to these key components we’ve been talking about throughout the webinar: The first being Chronosphere Lens that I just mentioned. This is, as I said, a discovery framework that we have built out to help put all the relevant telemetry data across different data sources, whether it’s open source or proprietary telemetry data. Integrate it and centralize it around the services that you are providing – owning, maintaining, operating, developing, and so you have a clearly identifiable mental understanding of what’s going on in the system that aligns to how you might think about things yourself as a service owner and provide all the relevant contextual information for each service to help you understand what’s going on with it during the investigation.

And that includes things like what has changed with this service recently. So, when you are getting alerted that you are burning through your error budget for a service, you can go and take a look at that service and lens and see if there’s been a deployment or a feature flag, or any other type of relevant changes that are happening in the system that would impact its behavior aligning to the way that you’re seeing with your SLOs.

And when that happens, we can go and further analyze the information if we need to develop hypotheses about what could be going on and prove to ourselves that a deployment is the cause of the issue or is not, in which case we can look elsewhere. Nate, I don’t know if there’s anything in particular that I’ve left out that you want to add here.

Julia Blase: I was actually thinking about something, John as you mentioned this, which is, it almost seems oversimplified to say: “Hey, be able to see a service and all the related things,” but like having that in one click too, without having to write a query to retrieve all that information or go to all those different systems. Like we touched on this earlier, I don’t care. You can have a ton of tabs. I know I have crazy tabs all the time. You just lose a little context every time you switch and have to look at something new or learn to write from QL or log QL or trace QL to pull back that information. I think it’s just, it’s been really powerful to see how much having it in one page with one click without actually needing to write a query has made this more accessible and more helpful for people.

John Potocny: Yeah, absolutely. It’s a great call out. A big part of what’s being done here is avoiding the need to manually curate the information. Anyone can build a dashboard to put relevant information together, but it’s a different matter to know how to access information as well as understanding what’s relevant. Being able to do it in an automatic fashion is a huge value add, I think, for our customers. 

Nate Heinrich: And the only thing I’d add is that sometimes, your change wasn’t the problem. Sometimes you didn’t make a change, but we have these dependent systems and, one of your upstream or downstream dependent services did make a change. Finding that out, even though you don’t own that service can be challenging, but when you can do that quickly, it can be a very effective way of understanding  the state of your system over the last few hours and give you some hints on where to investigate, even though your system’s having a problem, but you didn’t change anything.

John Potocny: And we alluded to this, but when you do have a change or a hypothesis that you want to go and investigate and prove or disprove, we’ve built our own dedicated capabilities to help with this hypothesis-driven troubleshooting. So, we actually call this differential diagnosis. No need to diminish words. This is based on distributed tracing data being able to take advantage of the context aware traces within your interdependent cloud native environments to analyze for a given component in the system, what is happening that’s corresponding to successful requests, failing requests, fast requests, slow requests and how has it changed compared to a historical point in time?

Catch the full webinar here and watch a demo of our features. 

3 Keys to Faster Application Troubleshooting

Share This: