While cloud native adoption becomes the architecture of choice for modern companies, organizations and teams are struggling to reap its benefits: speed, efficiency, and reliability.
Last month, our Head of Product and Solutions Marketing, Rachel Dines sat down with one of our Product Managers, Karan Malhi, to chat about all things cloud native, and of course, Chronosphere.
Using data collected from our 2023 Cloud Native Observability Report survey, Rachel and Karan discuss current cloud native challenges with complexity and troubleshooting, and how observability is here to help.
If you haven’t watched the webinar yet, you can read the transcript below.
Cloud native landscape: From human impact to architecture challenges
Rachel: Welcome everyone. I’m very pleased to be here today. I’m Rachel Dines, Head of Product and Solutions Marketing here at Chronosphere. I’ve been here for about two years, and I am very pleased to have my colleague Karan Mahi with me today.
Karan: Thanks Rachel! Hi, everyone. I’m Karan Malhi. I’m a Product Manager at Chronosphere, and I’m completing a year [at Chronosphere] by the end of January. I’m very excited for this webinar.
Rachel: Awesome. So, let me walk you through the agenda. We’re going to start with an introduction to the survey that we did, and help you understand the methodology, because what we’re going to be diving into in this benchmark is going to be based on a lot of data. I’m a data nerd, I think a lot of you, likely are, and Karan is as well. So it’s good to understand how we collect this data and what the approach is.
Rachel: The three main sections we’re going to talk about are, first the human impact of cloud native and how cloud native, as a technology and architecture, starts to impact teams and engineers. Then, we’re going to look at the challenges that companies are facing with their observability tooling and, especially as they adopt cloud native, how those challenges evolve. [Finally], we’re going to look at what the benefits are of best in class observability.
For the companies that are really leaning into observability and have implemented it really well, what kind of benefits and best practices do they get out of it?
Current cloud native adoption trends
Rachel: To begin, like I mentioned, this data we’re going to share today is based on the survey that [Chronosphere] fielded with a neutral third party, to 500 engineering professionals in the US. We fielded the survey in late 2022 and then analyzed the data just in early 2023.
Everyone who took the survey had some degree of familiarity with observability, and I wanted to just quickly share with you the demographics of the types of companies and the types of individuals who are answering the survey … just to show that it’s from companies of all different sizes, from small, medium to large and from people of different levels in the organization: individual contributors, team leads, managers, and executives. Throughout the survey, we’ll be showing a cohort across the entire organization.
Rachel: The other thing that I wanted to talk about is cloud native as a baseline and cloud native adoption. I’m assuming you have an interest in cloud native and observability, but it’s such an interesting term because a lot of people define it differently. I wanted to just quickly level set and talk about how, when I say cloud native, what do I mean? I use it as shorthand for microservices and containers, right? t’s distributed architecture, distributed applications, and distributed infrastructure. It might be that a lot of people are using Kubernetes, but it doesn’t have to be, it’s containerized, it’s serverless on the infrastructure side, and it’s microservices on the application side.
I was really interested to see when we asked people what percent of their environment was cloud native today and what they expected it to be in 12 months, the average portion of the environment that was cloud native was 46% today. And companies expect that to grow to 60% in 12 months. Clearly, there is a lot of aggressive cloud native adoption. The other thing that I thought was very interesting, was that only 3% of companies said they had zero cloud native use today. So, 97% of you on the phone today have cloud native in your organization, and it’s growing really quickly.
The reason behind rapid cloud native adoption
Rachel: I want to quickly ask you, Karan, what do you think is driving this rapid adoption of cloud native — from your perspective in talking to our customers?
Karan: I think the biggest thing is operational efficiency; having the ability to reduce the time to value and time to market. You have the same number of people while you’ve adopted these technologies to operate faster. And, how do you combine those two things along with the processes to ship the product to the market faster?
Rachel: I completely agree with that. It’s speed, time to market, time to value and scale — those are some of the biggest drivers. Now we know what cloud native is, we know why people are adopting it, and that it’s being adopted very quickly in many organizations. Let’s talk a little bit about the human impact. A lot of this is going to explore: as we adopt cloud native and as we migrate more and more services or re architect or, as more of our estate becomes cloud native there, we’ve found that there’s a lot of unintended impacts — I don’t want to say consequences because cloud native is obviously the future, this is where most organizations need to go. But we do need to watch out for what happens in the organization when we adopt cloud native. One of the biggest impacts that we’ve seen is on people and the amount of time that they end up spending on low-level tasks like troubleshooting.
The troubleshooting time waste for engineers
Rachel: One of the first things we found was that engineers are spending on average 10 hours a week troubleshooting. If someone’s working a 40 hour week, that’s 25% of their time. But, what’s worse is that we believe that the majority of this is actually unplanned troubleshooting. Karan, I know you have a lot of opinions on the topic of unplanned interrupts.
Karan: Any interrupt driven-work has a much bigger impact than what we think it has. There are well-known studies which tell you that for every interrupt, if you’re zoned in, you’re working on something important and you get interrupted, it takes you about 20 to 25 minutes to refocus.
People who work on troubleshooting generally don’t have enough context. So, when they get interrupted and get called in to help with troubleshooting, it’s just a much bigger impact. And, the result is that people literally dread being on call. Like they, they’re stressed out and they, they’re, they don’t want to be on call, and they just end up wasting a lot of time for their main work, to deliver a product.
Rachel: Yeah, absolutely. You talked about the impacts of on-call troubleshooting. I will share what we found in the survey. We asked respondents: “what’s the impact of spending so much time troubleshooting? What’s the impact of spending 10 hours a week or more?”
Fifty-four percent said they’re frequently stressed out. Forty five percent say they don’t have time to innovate. So, they came to this role to solve tough problems. And instead they’re stuck in the weeds troubleshooting and triaging. Twenty-nine percent want to quit. This is an extraordinary number. I specifically looked at individual contributor engineers for this data cut. I’m assuming that most executives, most VPs and CTOs are not spending time troubleshooting. If they are, you might have other bigger organizational problems. But, this was what we were seeing from the results. One in almost a third said that this made them want to quit. This is a pretty depressing statistic.
Treating your teams as customers via alerting context
Karan: Forty-five percent say alerts don’t give enough context to understand and triage the issue. That’s a big number and that’s a result of these microservices. The very things which make you go fast, actually slow you down. Most of the alerts, they work for my team, but they may not work for your team. For example, you’re building a service. I’m dependent on the service, you’ve tested the alerts, or I’ve tested the alerts and my team understands those, but do the rest of the teams understand those alerts? So, we literally have to treat other teams as customers of our teams. It’s not enough for us to collectively treat our external customers as customers. We have to treat each other as customers and make alerts and make those contexts available to each other.
Rachel: Yeah, absolutely. You’re, you’re jumping ahead a little bit into the next piece, which is how we’re gonna solve some of these challenges, but I completely agree that being on call is a necessary evil, right? But being on-call doesn’t always need to be this painful.
Thank you all for who voted in the on-demand poll. Interesting to see these results are similar to what we found in the survey. I’m happy you say teams are stressed out all the time. The next most common response was that 34% said engineers can’t work on what they want to work on, right? They keep getting pulled into firefighting, and so they can’t work on solving the tough, interesting problems. And then, 28% say you’re not meeting deadlines. So this all comes back to speed and innovation and trying to move faster with cloud native. Twenty-five percent of you say personal life is negatively impacted.
Why the troubleshooting workflow is broken
Rachel: Clearly, there’s something not right here. How are we going to move forward and start fixing this? Well, I’m gonna bring you into a little bit more of the depth of despair before we have too many more solutions. But, part of the reason we believe engineers are spending so much time troubleshooting, is that the alerts that they get don’t provide enough context or don’t have enough information to solve the issues, just as Karan was saying.
So for example, seventy-three percent of IC engineers say that half the alerts they get from their observability team are not usable. And forty-three percent say they don’t get enough context. So, if they’re spending all this time troubleshooting in a fair amount of pain, I think of the observability workflow as really starting with: First you need to know there’s a problem, then you can triage it, then you can remediate it and understand it. If the very first step of that workflow of knowing there’s a problem or getting an alert is just broken, the whole rest of the workflow is going to start to fail and take a lot longer. Karan, help me understand a little bit more about why alerts are not generally helpful. What are some of the reasons you typically see?
Karan: We do all this work in building services and delivering services, and we do that fast. Do we spend enough time curating the information — which is actually going to be useful when somebody else is on call? And the answer is no. Either we don’t have enough time because of all the interrupt-driven work we have, or we don’t think about that part of it. So, having some sort of a curation of information, which is actually going to be useful for people outside of my team. And that’s what I said earlier: that treating other teams as customers of your services, just building APIs is not enough. Having that run in production, and enabling other teams to actually go take care of your service when you’re not around, that’s the most important part. And one big part of that is providing context within the alert itself. And alerts are small. You can’t just cram all the context in one place, so you have to find a place like maybe spend more time curating dashboards, which have all the context in them. So it’s just a process and a mind shift: treat other teams as customers.
Continue this part of the conversation at 13:07
Know, triage, understand
Rachel: That brings me to the next topic, which is top observability challenges. I hinted at this a minute ago, but we should probably also level set and define what is observability, because even more than cloud native, I think there’s a huge amount of confusion and different definitions of observability and I don’t want to get into a religious war about the definition of it, but more maybe explain how, how I see it and how we see it at Chronosphere so that you have context when we talk about observability and the challenges. So, generally we see it as the process of first knowing that something has gone wrong or about to go wrong.
And you need to know that something is wrong before a customer knows before it impacts the customer, whether it’s an internal or external customer, because then you need to go triage it. Figure out how bad it is, how many teams are impacted, how many customers are impacted. If this is the middle of the night in an on-call situation, “am I waking other people up? Or can it wait till the morning?” Like what’s the urgency? What’s the priority? And from there you want to jump into remediating or stopping the pain as quickly as possible. And then once you’ve remediated, you can take a step, take a beat, take a breath, go back and do root cause and truly understand why the issue happened and make sure that it’s resolved so it doesn’t really truly fix, so it doesn’t happen again.
Rachel: So again, this know, triage, and understand workflow is really how we look at observability. It’s powered by data like metrics, traces, logs, events, but those data points or those pieces of data don’t in and of themselves truly give you observability.
The struggle with complexity and cardinality
Rachel: When I think about observability in this context, it can be very challenging to implement in a cloud native world. And we’ll talk about a little bit more of this in a second. When we asked in our survey, if observability was more difficult when you adopt cloud native, 87% basically agreed, they said yes. When you do, when you’re trying to troubleshoot and discover issues and incidents and understand, and go through that, know, triage, understand life cycle, it’s more complex in a cloud native world.
You felt that there’s so many different interdependencies, many different microservices, many different containers. How are they all connected? How are they all dependent on each other? But there’s also this other component to cloud native called cardinality — that really drives a massive amount of complexity in environments. So, one of the things we specifically asked in the survey is: “Are you struggling with complexity and cardinality? And by what degree?” And what we found was when we broke out the top quartile in the bottom quartile of cloud native adopters, there was a really big difference in how they viewed a problem like cardinality. So, the companies that are far along on their cloud native journey, they’ve been there, done that, and pretty advanced: more than half of them said that cardinality and complexity is a huge issue and observability, compared to the bottom quartile where it was a little more than a third.
Defining cardinality (by lollipops)
Rachel: What I ask you to take away from this is, if you are earlier in your cloud native journey and still trying to figure out what kind of pitfalls you might be facing down the line, know that those of you who are a little bit further ahead are facing this major issue with cardinality. I should explain what cardinality actually means because I’ve used this word about 15 times in the past 60 seconds. It describes the possible unique combinations that you can achieve in a data set.
Here’s how I would explain this to my five year old. If I have 20 lollipops and Karan has 20 lollipops, we both have 20 lollipops. But if mine come in five flavors and 3 sizes, then my cardinality or unique combinations are 15, right? I can have a medium cherry or a small lemon or a large grape. There’s 15 possible combinations I could have. But if Karan’s 20 lollipops only come in 5 flavors, his cardinality is only 5. We have the same number of lollipops, but mine have a lot more complexity in cardinality than Karan’s.
Rachel: Why should you care about this in a metrics world? In every one of these different combinations instead of flavors and size combinations, it’s data streams, it’s time streams. So, let’s say you have one engineer who decides to add UUID (Universal Unique Identifier) or pod ID to their metrics, all of a sudden your car cardinality has just exploded. Once again, why do you care, what’s so bad about this? Well, you can’t find what you need. It’s very, very expensive and it really slows everything down. And, cardinality, I think about it as Goldilocks. Too much is bad, too little is also bad.
Continue this part of the conversation at 20:35
Let’s talk current observability solution complaints
Rachel: The hard part becomes figuring out: “How do I know what’s useful and what’s not?” That’s obviously the million dollar question, but this is really the crux of why observability is so much more challenging in a cloud native world.
One of the other survey questions that we asked was: “What are your top complaints about your current observability solution?” We also wanted to hear from the audience. So, if we could push the next poll and folks could take a second to vote on your top single top complaint about your current solution.
Rachel: One thing that I found very interesting, and I guess maybe not surprising [is that] 0% of leaders had complaints about observability, but a 100% of individual contributors had at least one complaint. Because the individual contributors are the ones who are in the tooling day in and day out. But the leaders, I think, need to spend a little bit more time to understand the challenges that their people are facing.
So you can see in ascending order, the 38% said the tool is too slow, right? Once again, coming back to the on-call situation, if you’re, it’s the middle of the night, you’ve just been woken up by a page and you’re trying to load your tool and it’s really slow, it’s incredibly frustrating. You’re losing time, you’re losing sleep. This makes sense that this is a top complaint. Forty-five percent say it takes a lot of manual time and labor, and I think this comes mostly into the feeding and maintenance of the observability solution to make sure it’s continuously up to date, but also likely to build dashboards, build alerts and get the information out that you need. It’s just too time intensive. Forty-six percent said too much data. So, this comes back to exactly what we were just talking about in that — more data is not always better, right? You can’t find what you need, that’s not helpful. And then, 62% said that their companies were actively looking for ways to reduce observability costs.
Too much data and meaningless context
Karan: If you look at the slide, this is all about cardinality: slow dashboards because too much data, a lot of time and labor, and obviously the 46% – too much data. And because of too much data, you have too many costs. It costs to ingest, it costs to store, it costs to query and retrieve the information. And it’s almost like a paradox. It’s almost like you, you wanted to build context and you got all the data, but you actually are lost. There is no context. It’s actually meaningless context. It doesn’t mean anything. All this data doesn’t mean anything because I can’t make heads or tails or sense out of the data itself.
Rachel: Yeah, meaningless context. It is a paradox, that’s for sure. So, really interesting results here [in the in-webinar poll]. It’s really spread across the board, which to me just speaks to the fact that there are so many different unique challenges that people are facing in observability.
I’ll call out the top responses here. Twenty-four percent of you said: “I get alerts that don’t give me enough context to understand and triage the issue.” Wow. Okay. Well that was literally exactly what we were just talking about. A really common challenge, that we can fix with better process, better tooling, right? A lot of this stuff is like, there’s no tool that can magically fix all of these problems. There’s tools that can help give you more context and help make it easier and more natural to add the context. But a lot of this just has to become an organizational muscle.
Rachel: Then, the second most common one: “I frequently get paged after hours for alerts that are not urgent.” So, more alerting challenges. And that makes sense. No one likes that and we need to figure out a way to get better triaging on that. Then tied for third place, the tools are slow to load dashboards to queries, and customers report incidents before we know about the issues. It’s a little bit different from what we got from the survey, but to me that just speaks to the fact that there are a lot of different challenges that people are facing with observability.
Cloud native without observability: Don’t be the ostrich
Rachel: Alright, let’s move into the last section, which is talking about the benefits of best-in-class observability. So, if you’re on your cloud native journey and you’re trying to figure out: “Are things going right, are they not going right?” One of the common trends we’ve found is that observability is really a baseline to help improve many different components of your cloud native stack. And this is not a controversial statement, right? Pretty much everyone agrees that observability is critical for success and cloud native.
And you can see, 87% say observability is essential to success. Seventy-one percent say my business can’t innovate effectively without good observability. The promise of cloud native is speed and time to value. If your system is broken all the time, you’re never going to get to that speed promise. It makes sense that there is this tight linkage.
Karan: It’s like a bird which can’t fly. It’s supposed to fly fast, but it can’t fly. You’re putting in all the cloud native stuff to move fast, but you can’t.
Rachel: Yeah. Cloud native without observability is like a bird who can’t fly. That’s an ostrich. Ostrich are birds that can’t fly. They run really fast on land, but they’re never going to be able to leap over that chasm, right? They’re never going to be able to get over to the next stage … so if you’re trying to do cloud native and you don’t have a strong observability strategy, or you’re trying to take your legacy observability or monitoring with you from the old world to the new, you’re acting like an ostrich.
Why does everyone miss their MTTR?
Rachel: If everyone agrees that observability is critical for cloud native success, I’m just very disappointed and disheartened to see that we’re not great at it – that 99% of organizations are missing their mean time to remediate (MTTR). In this case, we specifically asked for a meantime to repair. So, Karan, why is everyone missing their MTTR?
Karan: I think the way we approach MTTR or observability – we make it a very tech-driven versus a problem-driven thing. If you see how Google actually came up with sensors around tracing, they had a real problem. There were thousands or hundreds of thousands of containers, and they wanted to make sure they could trace through all those containers. So, there was a problem-centric approach to solving that for themselves.
We have gotten to this whole notion of three pillars or six pillars, and we try to check boxes with: “Hey, we have metrics, we have logs, we have traces, and if we have all three, then we are done.” Well, if that’s the case, then why don’t we just buy disparate tools?
Karan:And then once we buy them, we buy one for logs, one for metrics, one for traces, the problem is still the same. When we are on call, nothing ties anything together, and we have to jump from tool to tool, that wastes a lot of time. And that also increases a lot of your learnings you have to do across these tools. So, one of the big things is, yes, you can treat it as a tech problem, but the main thing is, what is the main problem for your application, for your business, which you’re trying to solve. And every company’s unique. Their business solutions are unique. They’ll have to have a way to bring it into a more problem-centric approach versus a tech-driven approach.
Rachel: That makes a ton of sense. It’s not box checking, right? Like: “let me take your order — a log, a metric, a trace.” It needs to be about problem solving. I think that’s definitely a big part of it. The other thing that’s a hot button issue for me is, nobody knows what MTTR means. Is it meantime to remediate, meantime to restore, meantime to repair, meantime to Rachel? We don’t have a standard definition of this. We also don’t know when to start the clock on MTTR. Is it from the moment the issue actually happens, or is it from the moment that the alert fires or the moment that you know about it, or the moment that the customer reports it?
The concept and approach of MTTR
Rachel: There’s that, and then the other bigger issue is that means and averages are not that helpful. They don’t show you the outliers. In fact, most people are looking at P99s and not looking at means for most metrics. So, it does seem like we are struggling with MTTR, but it’s both the concept and how we approach it that are not working well for us. I meant to say at the start too, the methodology behind this number, because I know you might be thinking this sounds like a marketer made it up, and I have to admit it, it does a little bit. But, the way we got to this data point was we asked every respondent in the survey, 500 respondents, what their actual MTTR was.
Then, we asked those same people what their target meantime to repair was. So, we looked at it and only 1% met or exceeded their target. It was all self-consistent within the same organization. So, I’m going to show you the raw numbers in a second, because you’re probably wondering: “Is my MTTR good or is it bad?” I can’t really give you an answer because I think it needs to be consistent with your own organization. You set your own targets and then you try to meet and exceed your own targets. But, unfortunately, we’re not right now … the main factor that we found that impacted this statistic was actually what kind of observability tool they used. We found that, in terms of the MTTR hours, the big gap is based on the tools that people are using, no one is meeting their targets. But companies that use vendors for observability had a lower actual, and lower target, MTTR. Karan, talk to me a little bit because you’ve been there. Why do you think this gap exists?
Karan: I think one reality is that it’s the same number of people you have, and they have a set of skills. So, I’ll give you an example. I was making infrastructure for internal developers. Basically, I was a product manager, and we went with the approach of: “Hey, let’s take open source solutions and let’s build our own in-house for a certain aspect.” For logs, we had a vendor-driven solution, but for metrics-driven observability, we built it in-house. You can’t scale that. It’s impossible to scale.
Again, it leads to the disconnected problem that I have one solution for logs and one for metrics, and another one for tracing, and we didn’t even get to that because we were just so consumed with those 27 people to build this internal observability apart from taking care of the infrastructure and taking care of interruptive requests and so on, so forth. So it doesn’t work out.
Karan:In the end, you can’t just hire fast enough or you don’t have the budget. So, you can just go as far with the in-built open source (OSS).
Rachel: Yeah, that is very consistent with what I’ve seen from the customers and end users I work with as well. There’s a lot of great reasons to go open source. I mean, just having that futureproof, no lock-in, is so important, especially today. But it doesn’t scale in a lot of organizations.
Vendor tools versus in-house
Rachel: We just got the last poll results back. It looks like fifty-six percent of you are using a mix of both a mix of vendor tools and in-house. That makes a lot of sense. That’s very, very common. It might be between different teams or applications or departments, or it might be for different use cases, or might be just that observability is really, really broad and you might need a lot of different tools to do with all the things you need to do.
Then about twenty-eight percent of you are vendor tools only, and then 16% of you are in house tools only. So, once again, this might be helpful for you as a benchmark. It might not be. MTTR is one of those things where it’s impossible to define what’s good with your peers unless you make sure you’re using the exact same methodology. I almost hesitated to even show the raw numbers here, because I think what’s more important and interesting is the gap. And you will see that, for vendor observability and in-house observability, that gap, that percentage miss, is actually just almost exactly the same — it is one or two percentage points apart. So, like I said, MTTR is maybe not even the best thing to be measuring anyways. But, it is one yardstick that we can use to understand how we’re doing against our own expectations and requirements.
Three key takeaways from the chat
Rachel: All right, so just going take a few minutes to wrap up and then we will have time for your questions. Coming back to the beginning, I think if the three things you take away from this are:
- Observability needs embedded context. Without that context, people are going to be running around firefighting, and not be able to accomplish what they need to. It’s going to lengthen MTTR.
- One of the ways you can get to a better place here is if you treat every team like a customer, and every team is every other team’s customer. Taking a little bit of extra time up front to add the context that you need will go a really long way.
- Cloud native drives increased complexity. I don’t think that’s a controversial statement. But if you’re not careful, you can get stuck in this paradox of meaningless context
- If you have so much context, that’s actually in some ways worse than no context. It’s another Goldilocks situation — where you need to find a way to narrow it down to just what’s relevant. There are different tools and capabilities in a modern solution that can help you with this.
- We need speed, we need scale, we need agility. But, if you’re not simultaneously investing in observability, you’re going to struggle to recognize some of these benefits. You might end up being that flightless bird, or ostrich, right? Where yes, you’re a bird, yes, you can go fast, but it’s not going to get you from A to B or where you need to be.
Any additional comments on this, Karan?
Karan: I would say that a lot of organizations are also in the proof-of-concept (POC) phase of adopting cloud native. And it’s okay to adopt OSS to begin with. It’s a good start, it’s a rapid way to complete the solution, and complete your stack. Just keep in mind that there are some process changes you’ll have to make. You’ll have to spend more time on the curation and treating other teams as customers. And that should reflect your alerts. If you are looking at the longer term, let’s keep in mind that you’re in one region today, you might be in multiple regions tomorrow. And would your current solution in the POC phase scale to that? And what are your plans for something like that? So, just keep a couple of those things in mind.
Rachel: Yeah, that’s good advice. Thank you, Karan. And with that, we would love to turn it over to you and take your question. So, you can head on over to the Q&A box, type in your question and we would love to answer them.
Listen to the rest of this conversation, including the Q&A session, at 39:19