Why are we so bad at mean time to repair (MTTR)?

A person is pointing at a wrench symbol on a blurred background, indicating a bad MTTR (mean time to repair).
ACF Image Blog

Ninety-nine percent of companies aren’t meeting their MTTR goals. So what can we do to close the gap between desired MTTR and the reality?

Rachel Dines
Rachel Dines Head of Product & Solution Marketing | Chronosphere

Rachel leads Product & Solution Marketing for Chronosphere. Previously, she built out product, technical, and channel marketing at CloudHealth (acquired by VMware). Prior to that she led product marketing for AWS and cloud-integrated storage at NetApp and also spent time as an analyst at Forrester Research covering resiliency, backup, and cloud. Outside of work, she tries to keep up with her young son and hyper-active dog, and when she has time, enjoys crafting and eating out at local restaurants in Boston.

6 MINS READ

recent survey uncovered a fact that I found extremely depressing: Only 1% of companies are meeting their mean time to repair targets. That’s right, that means 99% of companies miss their targets. Before you call shenanigans, let me explain the methodology: The survey, which went to 500 engineers and engineering leaders at U.S.-based companies first asked: “How long, on average, does it usually take your company to repair an issue?” Next, the survey asked them to answer “What is your target mean time to repair an issue?” The results were grim. Only 7 of the 500 responded that they met or exceeded their MTTR goal (their average MTTR was the same or less than their target MTTR).

How much did they miss by? The average company was aiming for an MTTR of 4.7 hours, but actually achieved an MTTR of 7.8 hours. Woof. Why and HOW are we so bad at this? I have a few theories. Let’s dig into them.

No one knows what mean time to repair really means

MTTR stands for “mean time to repair” and is a measurement of how long it takes from the first alert that something has gone wrong to remediating the issue. I state that like it’s a fact, but I’ve also seen MTTR stand for:

  • Mean time to restore
  • Mean time to respond
  • Mean time to remediate
  • Mean time to rachel

But given we specifically asked in the survey, “What is your mean time to repair,” I feel confident, I know which “r” people were answering for. But then again, what does repair actually mean? I asked several people and got two different schools of thought:

  1. The time it takes to get systems back to some semblance of operation.
  2. The time it takes to completely fix the underlying problem.

So which is it? That’s one of the reasons that this is impossible to benchmark against your peers. But you CAN benchmark against yourself, and this still doesn’t explain why 99% of companies aren’t meeting their own expectations for MTTR.

Observability tools are failing engineers during critical incidents

My second hypothesis also comes from data within the survey: The observability tooling from alerting to exploration and triage to root cause analysis is failing engineers. Here’s why I say that: When we asked individual contributor engineers what their top complaints about their observability solutions were, nearly half of them said it requires a lot of manual time and labor. Forty-three percent of them said that alerts don’t give enough context to understand and triage the issue.

 

If the tools they’re using to repair the issues take a lot of manual time and labor and don’t provide enough context, that is likely a large part of what is causing companies to miss on their MTTR targets.

Cloud native is a blessing and a challenge

Many leading organizations are turning to cloud native to increase efficiency and revenue. The respondents in this survey reported that their environments were running on average 46% in containers and microservices, a number they expected to increase to 60% in the next 12 months. While that shift has brought many positive benefits, it’s also introduced exponentially greater complexity, higher volumes of data and more pressure on engineering teams who support customer-facing services. When managed incorrectly, it has significant negative customer and revenue impacts.

This isn’t a controversial statement. The overwhelming majority of respondents in the survey (87%) agree that cloud native architectures have increased the complexity of discovering and troubleshooting incidents.

Image Alt Text

So how do we get better?

If we dissect the MTTR equation, it might look something like this:

 

I would start by attacking each piece of the equation and finding ways to shorten each step:

  • Time to detect: TTD is often overlooked, but it can be one of the ways you can cut down your MTTR. The key to reducing mean time to detect (MTTD) is making sure you are collecting data as frequently as possible (set a low scrape interval) and your tooling can ingest and generate alerts rapidly. In many industries, this is where seconds make a big difference – for example Robinhood used to have a 4-minute gap between an incident happening and an alert firing. FOUR MINUTES. That’s a long time in a heavily regulated industry with high customer expectations. But by upgrading its observability platform, it shrunk MTTD down to 30 seconds or less.
  • Time to remediate: This is the amount of time it takes to stop the customer pain and put a (potentially temporary) fix in place. Once you’ve shrunk the time to detect, you can attack the gap between detection and remediation. This is typically done by providing more context and clear actionable data in the alerts. Easier said than done! In the recent survey we found that 59% of engineers say that half of the incident alerts they receive from their current observability solution aren’t actually helpful or usable. We already saw earlier that 43% of engineers frequently get alerts from their observability solution without enough context to triage the incident. When alerts have context as to what services or customers are affected by an incident and link the engineers to actionable dashboards, the time to remediate shrinks a lot.
  • Time to repair: If you’ve optimized the time to detect and the time to remediate, the time to repair actually becomes a less critical metric. Why? Assuming you define time to repair as the time to fix the underlying issue AFTER remediating, that means that customers are not affected while the repair is happening. It’s still something to keep an eye on, as long time to repair often means that teams are stuck in endless troubleshooting loops, but it doesn’t have as direct of an impact on customer satisfaction.

Want to dig into this data further? The whole study that this data is based on is available here. I encourage you to give it a read and reach out to me on Twitter or Mastodon if you have any additional insights to share.

Share This: