How to reduce mean time to remediate today

A businessman holding an hourglass in his hand, working to reduce the mean time to remediate.
ACF Image Blog

Missing mean time to remediate (MTTR) windows is shockingly common for organizations.

14 MINS READ

Without the right tools, engineers don’t know what data is most useful and can provide the most context to quickly remediate systems. So what can we do about it?

In our second live demo series, “Learn How to Reduce MTTR and Make Distributed Tracing Useful,” we explain how Chronosphere is democratizing distributed tracing to make trace data more easily understandable.

Chronosphere’s Senior Sales Engineer Gilles Ramone and Senior Product Marketing Manager April Yep chat about trace derived metrics, topology view, deep linking, and more. If you don’t have time to sit down for the full recorded conversation, check out the transcript below.

Learning how to make use of data

April: Gilles and I are going to talk about learning how to reduce MTTR and make distributed tracing useful, in this live demo series. For those of you who aren’t familiar with Chronosphere, I just wanted to take a minute to talk about us. We can ingest metrics and traces. We take that data into Chronosphere, in a single tenant SaaS observability platform, and make that data useful for you with:

  • Notifications
  • Triage ability
  • Root cause analysis
  • Contextualized data
  • Data control in your existing dashboards and alerts.

We have high reliability and performance. You can see some of those data points here: 50% time, less time spent troubleshooting, 48% average data reduction after transformation, 99.99% historically delivered uptime at scale. We are one hundred percent open source compatible.

Now, let’s go into the focus we’re going to have today. You know that we can ingest metrics and trace data, [but] we’re going to focus on traces. And with that, I’m going to hand it over to Gilles to talk about the demo.

Gilles: We are indeed going to focus on traces, but at the same time, we realize that outcomes are the most important for all of us. We will blend the contribution of metrics, traces, and logs into a full incident resolution scenario. You will see, and I will point out at the right time, that some of these signals are coming from metrics, some of them are coming from traces, some of them from logs. But again, you’ll see all of that in action.

We realize that you might not all be very familiar with Chronosphere in general. So we’ll start by a little bit of our user experience.

An eye into Chronosphere’s user experience

Gilles: [The Chronosphere user interface] is very streamlined, right? It’s contextual – as April mentioned earlier – which means that you’re not going to be bombarded with a lot of dashboards, a lot of alerts that are not necessarily your concern. They might be somebody else’s concern. With that comes a curated experience where you can see here at the top, the alerts and the monitors that pertain to a particular service or a particular team, your team.

But again, not bombarded with a ton of data. [It’s a] very focused experience. Now, having said that, the reason why we are in here right now, is because we’ve been alerted that there is a high level of errors with a particular service; the order service, that is. This might come through PagerDuty or whatever tool you use for that purpose.

It is displayed for convenience as well. You will essentially just click on the alert that is of interest to you, like in this case, this list order service error. And, you can see right away why this alert was triggered. It is above the threshold for triggering this alert, here. We already see a little bit more about what’s going on.

It’s very high level. We’ll drill very quickly into the issue that is of importance to us. But, we can already see that, although this is about the ordering service, it’s actually the list orders – method or operation – which is flashing red right now. And, it seems like overall, the checkout is flashing yellow. This is our starting point. It’s too early for us to tell what’s going on.

I should point out that, at this stage, the error that we got is very generic. We will actually fix that over time, and this is why the MTTR will get so drastically reduced. Because we’re now on this page, we have the ability to start drilling into the issue and essentially get the vitals about this order service.

Video Thumbnail

The trifecta: metrics, traces, and logs

Gilles: Now, you can also see here in this particular dashboard … we’ve got the log volume. In this case, we are able to inspect the logs wherever they happen to be stored, right? In this case, we are storing them in Google Cloud Platform (GCP). You could store them pretty much anywhere you want, but we can still give you a reading of the distribution of logs by level.

For example, whether it’s info warning or error, and that is a signal that we can then alert on. I will say that many of our customers who don’t necessarily have   instrumentation rely on logs. We can absolutely work with those logs to get signals from them. And certainly, over time, making the metrics ready, for example.

But that is, again, something that can be planned ahead; metrics, traces, and logs all come together. And in our case, we aren’t concerned with the number of errors, and the P99 (99th latency percentage) duration. What I’m going to do now, is start asking Chronosphere to filter the traces that exhibit this behavior that I’m seeing.

In this case, I’m seeing query traces for which the duration is over essentially 10 seconds. What I did is seamlessly switch from a metrics approach, to start looking into my traces. I do that without having to be an expert in tracing. That is definitely something to keep in mind, because what’s happened in the industry is that typically distributed tracing is a very appealing proposition.

But when it comes to implementation, only a few people know how to use it, which really hinders adoption. And, essentially, it’s discouraging for people.

But this isn’t the case with Chronosphere. What you’re going to be able to do here, is continue to drill down, let Chronosphere analyze all of the traces as you continue to refine the traces that you’re then going to investigate. You’re not going to go into a wild goose chase, or be misled by what seems to be the golden trace. Instead, you have full control, you’re analyzing, with Chronosphere, the behavior of those traces. So let’s see that in action.

Video Thumbnail

April: You’ll notice that when [Gilles] comes in here and hits Query Traces, it automatically builds the query in the “include summary” section. So you don’t have to figure that out yourself. He showed in the dashboard: “Hey, I want query traces,” and it already figures out the criteria for you.

Zooming in and refining to triage

Gilles: We’re going to continue to build that search, as we get guided by Chronosphere. But, there are some things that you might also be interested in terms of triage right before you get to anything else, like: “What is the scope of the issue here?” You can look at it in different ways.

One of them is to use the top tags here, which tell us the tags that are part of the behavior that you were seeing, part of this error [Then] we can start looking for patterns. For example, if we see a hundred percent error for particular tags, we know something is up – for example, the build version. That can already tell you: “Okay, there might be a new release, there is a new build, and this build version is part of the issue.” If you need to just roll back, you can do that immediately. So that’s one thing as well.

Something else here, is you are able to see a topology of what is going on here. The topology here, you probably have hundreds of services in, in your different clusters, right? We definitely have a view for this, but it would be relevant when you’re trying to find the issue in this particular case.Instead, we are zooming in on only the services that participate in the issue. Like in this case, only the portal service, the ordering service, and then the backend here are part of the topology that we are zooming in on.

It’s about finding the needle in a haystack. And we’re doing a good job here of refining our search to do that. To that point, you can continue on and say not only: “Hey, show me everything ready to the ordering service,” but also: “Show me everything that is related to anything coming from the portal service.” So, you can refine your search included in the span filter.

To April’s point earlier, my search summary is being built as well. So, I don’t have to know ahead of time, all of the labels, all of the tags that are available in my environment, and figure it out. It’s all based on a natural approach – refining the list of queries and traces that I’m getting back. When I feel that I’ve reached a point where [I think]: “Okay, this is actually good for me,” I can actually also take a look at what was happening before.

Video Thumbnail

This is also very helpful. [To see] what’s happening immediately before this behavior starts happening in the environment, right? If I look at my statistics, I’ll see the statistics right before [the incident] (which is B) and right after (which is A). Clearly something is up in this case. The ordering service is getting a lot of requests — about 1100 per second — versus hardly at all before. So, that’s very helpful: To have this baseline to search for. But, at the end of the day, you will actually get to the traces themselves. At this stage, we have refined a list of traces, and after we feel confident that when we start digging into one of them, they are relevant to what we’re looking for. In this case, I can take a look here, see the duration and the number of spans.

There are actually a lot of them, like 3,000 spans in that particular trace. So definitely something is up and I can click on [the trace]. This is probably a view that you’re all very familiar with, when it comes to distributed tracing or tracing in general. That’s the waterfall view, which is absolutely critical. But you see everything that has happened before, right?

What’s differentiating in terms of being successful or not successful with distributed tracing has a lot to do with how you get to that trace, right? Not the trace itself, which of course is a very important, but somewhat table stakes. Hopefully you saw when you use Chronosphere, you can refine that search to get to it. We can see here, again, the waterfall view, and the different calls – from the portal service into the ordering service.

More than investigating

Gilles: In this ordering service, you see that there are over 3,000 calls that are made, initiated by the ordering service, and they all seem to go to this MySQL backend. There’s over 3,000 calls to the MySQL backend. This is clearly an example that somebody made a mistake. In this case, just iterating through a collection, and hauling the database for each single item in the collection, instead of using a batch query, right?

By using this approach, we understand what the problem is. And, remember the title of our presentation today is: “How to reduce MTTR.” While part of how you can reduce MTTR is by going through this investigation process, there’s actually more to this story, and I’m about to show you exactly what I mean by that.

We have created a query that returns a number of traces, that essentially are exhibiting this error behavior. In our case, I can even complement this query by adding  a span filter and for example, saying: “Hey, if I’ve got a span count that is greater than maybe 50, something is up, right?” I can just do that.

Now, I know that if I have any query, any results to that query of my traces, there is an issue. And this issue is the anti-pattern that I just investigated. So, instead of having to go through that over again the next time this issue occurs, I can create a metric. It is going to be a metric that is derived from your traces.

Monitoring metrics with Chronosphere

Gilles: And I can give you the name … like batch query, anti-pattern, and save it. Once I save [the metric], what will happen is Chronosphere is going to continuously monitor that metric, and I can create an alert based on that metric, so that if there are new results to this query, the alert that will be triggered will not be this generic error of: “Hey, there are errors with the order service.” But instead: “There is an anti-pattern detected in the order service itself.” Once you’ve created this metric, you can now explore it as well in those trace metrics. Because I created this metric, I can now see it as part of my trace metrics. And you see here, I have some results. I have not fixed the issue yet. In an actual case, the on-call engineer would definitely already have solved the issue based on the research that we’ve done together. But now, I have the ability to alert, based on whether or not there are requests returned by this trace metric.

The result of this is that in my landing page for my order service, I will be able to create a monitor [and] an alert.  And that alert will be based on the query here out of batch query on type pattern.

The results of this are that you have now been able to identify very specific errors, and the next time the on-call engineer has to deal with those errors, they will be very clear for him or her, and they will be able to resolve those errors even faster.

Hopefully, this helps you realize the value that you can get with Chronosphere, when it comes to reducing your MTTR, and making distributed tracing relevant.

Chronosphere’s topology view and critical path

April: We talked about dashboards, alerts, and RED metrics. We dove into traces … as you saw it’s pretty easy to have a query built for you by [Chronosphere]; it’s click and go. Then, [we] understood service dependencies and flows with a topology view. One area that is very unique about Chronosphere is our topology view.

One area that Gilles forgot to touch on, was the critical path. There is a feature about critical path, saying: “Hey, I just want to know what part of this trace is most critical for my infrastructure, for my workflow, and how is it really impacting things?”

Then we just saw how to build trace-derived metrics for a repeatable workflow. So that way, other engineers don’t have to figure out: “Okay, well how do I get an alert on this again?” or “how do I measure this?”

The other aspect is that we have deep linking between metrics, traces, and logs. We didn’t spend a whole lot of time on this particular piece, but we saw between metrics and traces – it’s pretty easy to navigate between the two. Then with logs, you can even have information about if it’s stored in GCP, for example, we can give a PermaLink to that log data.

Catch up on the rest of the chat at minute 20:08 for Q&A.

Share This:
Table Of Contents

Ready to see it in action?

Request a demo for an in depth walk through of the platform!