Ninety-nine percent of companies aren’t meeting their MTTR goals. So what can we do to close the gap between desired MTTR and the reality?
On: Jan 19, 2023
Rachel leads Product & Solution Marketing for Chronosphere. Previously, she built out product, technical, and channel marketing at CloudHealth (acquired by VMware). Prior to that she led product marketing for AWS and cloud-integrated storage at NetApp and also spent time as an analyst at Forrester Research covering resiliency, backup, and cloud. Outside of work, she tries to keep up with her young son and hyper-active dog, and when she has time, enjoys crafting and eating out at local restaurants in Boston.
A recent survey uncovered a fact that I found extremely depressing: Only 1% of companies are meeting their mean time to repair targets. That’s right, that means 99% of companies miss their targets. Before you call shenanigans, let me explain the methodology: The survey, which went to 500 engineers and engineering leaders at U.S.-based companies first asked: “How long, on average, does it usually take your company to repair an issue?” Next, the survey asked them to answer “What is your target mean time to repair an issue?” The results were grim. Only 7 of the 500 responded that they met or exceeded their MTTR goal (their average MTTR was the same or less than their target MTTR).
How much did they miss by? The average company was aiming for an MTTR of 4.7 hours, but actually achieved an MTTR of 7.8 hours. Woof. Why and HOW are we so bad at this? I have a few theories. Let’s dig into them.
MTTR stands for “mean time to repair” and is a measurement of how long it takes from the first alert that something has gone wrong to remediating the issue. I state that like it’s a fact, but I’ve also seen MTTR stand for:
But given we specifically asked in the survey, “What is your mean time to repair,” I feel confident, I know which “r” people were answering for. But then again, what does repair actually mean? I asked several people and got two different schools of thought:
So which is it? That’s one of the reasons that this is impossible to benchmark against your peers. But you CAN benchmark against yourself, and this still doesn’t explain why 99% of companies aren’t meeting their own expectations for MTTR.
My second hypothesis also comes from data within the survey: The observability tooling from alerting to exploration and triage to root cause analysis is failing engineers. Here’s why I say that: When we asked individual contributor engineers what their top complaints about their observability solutions were, nearly half of them said it requires a lot of manual time and labor. Forty-three percent of them said that alerts don’t give enough context to understand and triage the issue.
If the tools they’re using to repair the issues take a lot of manual time and labor and don’t provide enough context, that is likely a large part of what is causing companies to miss on their MTTR targets.
Many leading organizations are turning to cloud native to increase efficiency and revenue. The respondents in this survey reported that their environments were running on average 46% in containers and microservices, a number they expected to increase to 60% in the next 12 months. While that shift has brought many positive benefits, it’s also introduced exponentially greater complexity, higher volumes of data and more pressure on engineering teams who support customer-facing services. When managed incorrectly, it has significant negative customer and revenue impacts.
This isn’t a controversial statement. The overwhelming majority of respondents in the survey (87%) agree that cloud native architectures have increased the complexity of discovering and troubleshooting incidents.
If we dissect the MTTR equation, it might look something like this:
I would start by attacking each piece of the equation and finding ways to shorten each step:
Want to dig into this data further? The whole study that this data is based on is available here. I encourage you to give it a read and reach out to me on Twitter or Mastodon if you have any additional insights to share.
Request a demo for an in depth walk through of the platform!