The State of Cloud Native Observability
Plenty of solutions call themselves cloud native observability, and plenty promise cost efficiency. In a cloud native world, while navigating a growing sea of data and engineering burnout, it can be hard to find the right tool to give you the insights you need, when you need them, all the while supporting your engineers.
In this episode of EM360’s podcast, Chronosphere Head of Product and Solutions Marketing, Rachel Dines, sits down with Torsten Volk from EM360 Tech to nail down what true cloud native observability looks like.
Cloud native observability vs. traditional tools
Torsten: Welcome to the EM360 podcast, where we have a weekly conversation with people who are impacting the enterprise tech landscape. I’m Torsten Volk, I’m a managing research director at Enterprise Management Associates. And today I’m joined by Rachel Diness, Head of Product and Developer Marketing at Chronosphere. Today, we are going to talk about the state of cloud native observability. Rachel, cloud native observability: what is so special about it? How is it different from monitoring as we know it?
Rachel: That’s a great question, Torsten, and thanks for having me, by the way. It’s great to be here. Before I jp into that, I think we should levelset on what is observability, because that’s not even something that there’s a ton of consensus on. And then we can talk about what cloud native observability is.
So, observability, at least as I see it, and, and I’m curious to have a discussion about this if you agree, is that it’s basically solutions that are helping engineers remediate issues in their infrastructure and their applications as quickly as possible, right? And the way that these tools do this is by first helping make sure that the engineers know about the problem, ideally before the customer knows. Then, triage it so they can really dig into what service is impacted, how bad it is, how many customers are impacted, if this is like an on-call situation in the middle of the night, am I waking other people up? And then after, remediating, doing root cause and really understanding the problem in a lot more depth. That is at least how I think about how we think about observability at a high level.
Torsten: So, instead of that old separation of monitoring, logging, tracing via thinking about objectives that we wanna achieve for our business, and we want to enable engineers to proactively approach, or almost set attack those issues, but not for the issues sake, right? We don’t care about a red light just for the reason that it’s red. We care about that red light if it prevents us from achieving a business metric.
Harnessing data with the right observability tool
Rachel: Absolutely. I couldn’t agree more. I’m not a fan of thinking of metrics, logs and traces. Those are the inputs, right? That’s how you get to what you need. But it’s all about orienting this around the outcome you want. That’s observability as a whole. And then, if we scope it to cloud native observability, these are solutions that are specifically built for cloud native – speed, scale, ephemerality. And by the way, when I say cloud native, I’m using that as shorthand for containers, microservices, and decentralized distributed architectures in both infrastructure and applications. And, cloud native observability are solutions that have to be incredibly scalable and reliable to meet that speed and scale and complexity and cloud native architectures without compromising on reliability or performance. They have to be really focused on helping engineers get through massive data growth and massive amounts of that metrics and logs and traces.
Rachel: Because cloud native environments produce a huge amount of telemetry data. And we need this data to effectively operate and troubleshoot our systems. But, if we can’t harness it with our observability tool, we’re going to be lost, really fast. And then the third criteria I have is native compatibility with open source standards, because this is like a shift. We’re seeing significantly, in the industry, as people are adopting open protocols, open standards like Open Telemetry and Prometheus, we’re starting to see that really become something that is incredibly important for cloud-native observability.
Torsten: It is that seamless scale out capability architecture that allows you to just run through your application in your own data center or under your desk, even tomorrow. I’m going to scale maybe one aspect, right? Not even the whole thing, but I can scale, independently, the different microservices, depending on how they are used and how I want them to be used. I don’t have to worry about the observability platform missing what I’m doing and adding to my risk so that I have a lot of risk that the corporation is not aware of because it doesn’t know that I scaled out.
Troubleshooting and compatibility
Rachel: That is the thing that we have been seeing consistently as enterprises adopt cloud native – when they try and take their previous generation observability tools with them, they can’t make the jump. And like you said, it’s because they’re missing data. They can’t keep up with the speed, they’re not compatible with the right standards. So that’s why we’ve been seeing more and more people carve out cloud native observability as a specific subcategory.
Torsten: And there is a big other aspect that I’m seeing around cloud native observability, and that is bringing together developers and operators, those traditionally fractions where typically, when I look at the market landscape as an analyst, I see products that are good for what are adopted and accepted by operators. And then, there’s another group that are adopted and accepted by developers. But what I really need is both developers and operators working together to advance the organization, right? So, building one platform where they can really work hand in hand to developers, get easier instrumentation, learn about the impact of the code and the operating guys want to be able to monitor that stuff without overlooking anything, and without having to change the code and necessarily understand all of the code.
Rachel: Yeah, you’re spot on. And I think what we were talking about really is, DevOps, right? And the DevOps movement has been around a long time, relatively longer than cloud native, let’s say DevOps has been around for maybe 10 years. Cloud native is really only maybe five years old. But as people have adopted cloud native DevOps, it changes its form and shape a little bit, and we start to see increasing pressure on engineering. And to your point, a lot of the tools out there are not built with the mindset of engineers operating and troubleshooting their own code, right? They’re built more at the mindset of “someone builds it and someone else operates it.” And that’s just not the reality in modern enterprises today.
Torsten: Yeah. And that actually brings me to our next topic, and that is burnout. There is a relationship between cloud native observability and burnout. And, in fact, it can help prevent that burnout if it’s implemented, right?
The Cloud Native Observability Survey
Rachel: Yeah, I completely agree with that. I wanted to talk a little bit today about a recent survey that we just fielded to 500 engineers who are involved in observability. It was a mix of individual contributors and managers, but we found some really depressing statistics that came out of this survey, and I think a lot of this does point back to burnout.
What we found was that the average engineer spends 10 hours a week on troubleshooting. So if they spend 40 hours a week, that’s 25% of their time they spent troubleshooting. The downstream impact of that is, because they’re spending so much time on this kind of repetitive and unrewarding work, they’re not happy, and they’re burning out. And this survey found that because of that troubleshooting, one in five engineers wanted to quit their jobs. So if you need any example of how observability when not done well, doesn’t give the engineers the tool they need leads to burnout – this is it right here. If you have a great observability tool that can get engineers faster to the data that they need during incidents, and help them get back to their lives – you can reduce that amount of time troubleshooting and make them a lot happier.
And for the majority of organizations, that is a surprising new notion, right? They still have their separate budgets for logging and for security. And, it often shows that they’re not actually implementing that collaborative approach, that holistic approach that is focused on the business outcome. And that makes the developer’s life easier to deliver features to customers, and to build better products. Companies have often not internalized that.
Rachel: No, they, they haven’t. And even with things as simple as alerting, that’s the tip of the spear, right? To get an actionable alert. And, we found in this survey that 59% of the engineers said that half their incident alerts were not actually helpful. So, let’s just go back to basics here. Let’s get some of this stuff more tuned to how modern cloud native engineers need and want to work. I think that’ll go a long way.
Torsten: Yeah. And really not caring about infrastructure or networking for the infrastructure or networking’s sake, but caring about those components for the sake of our business, and how our application will perform for our customers. I think that is the key. And that also leads me to my next topic and question: how can cloud native observability lead to a better customer tension? How can it make customers more happy?
Navigating high customer expectation waters
Rachel: Well, I think we all know that customer expectations are incredibly high. They’ve been high for a long time. One trend we’ve been seeing more recently is that not only are our expectations high, but consequences are high in that customers are more likely to leave after having a bad experience with a product. Whether it’s a B2B or a B2C, there’s a lot of challenges with customer retention these days. And what we found, coming back to that survey, is that 99% of companies were missing their mean time to repair targets. So, we asked: “how long on average does it usually take your company to repair an issue?” And then we said: “what’s your target meantime to repair an issue?” And only 1% of the 500 respondents answered that they had met or exceeded their meantime repair goal.
So, coming back to a customer experience – if I am trying to access a service or I’m trying to buy something, whether I’m a consumer, whether I’m trying to access a B2B service and it’s down for hours and hours at a time, which the actual MTTR from the survey that we found was 7.8 hours, this is obviously going to be customer impacting,
Torsten: And the interesting thing is, you still have these false goals. For lack of better words, you still have that lack of prioritization that we are seeing where people are just trying to get those yellow and red lights off their dashboards without really knowing the business impact. And also even knowing, “what is the priority?” “What is the order that we should fix those things in?” If it’s multiple competing priorities, they cannot map it all the way through. If my S3 bucket fails, or somebody deletes something in my S3 bucket and my app doesn’t work properly, customers can often not see the impact on their observability customers. Organizations in general often cannot prioritize based on that business impact. And that’s what’s holding them back, and that’s what causes those long mean times to mean times to repair.
Speaking of the business impact, that’s something that we’ve seen a lot of our customers actually – trying to tie together their infrastructure observability, application observability, and business observability. One of our customers, DoorDash, has executive dashboards within Chronosphere that look at the basic infrastructure, what’s going on in their container, and in their Kubernetes environment. They have application metrics, and their microservices. And then, they can also see in the same dashboard, like how many orders are being placed, how many DERs are on the road, and can tie all of this back to: “wow, there’s a problem in the infrastructure and I can see the impact in the business as well.” And that’s just so important in how fast-paced modern businesses need to operate.
You can watch the rest of this part at 14:27.
How Chronosphere is different from the pack
Torsten: So, Rachel , when I attend conferences, like industry conferences, I usually see dozens of observability companies on the floor. It is one of the hardest topics today. How do you [Chronosphere] set yourself apart from the rest of the pack? What is your angle that you’re doing better than anybody else?
Rachel: I’m glad you asked. Chronosphere is a SaaS solution, and it’s the only observability platform that puts customers back in control of cloud native complexity and data explosion. One of our biggest differentiators is the ability to have total control over the efficiency of data. So, we talked a little bit earlier about data growth in a cloud native world. And coming back to that, when you make this change from a more traditional, like monolith and VM-based environment to a cloud native, microservices and containerized environment, the amount of data that’s produced is a lot on the metrics, logs and traces side. It leads to a lot more complexity. It’s a necessary evil of cloud native, but it gets incredibly overwhelming. So, with Chronosphere, we have this capability, called the Control Plane, which is incredibly unique, and it can help give customers the power to transform their observability data based on their need, their context, and their utility. It scales it into a format that you could actually understand, and then controls, and runaway costs. So the average Chronosphere customer actually is able to transform their data and get a 48% reduction in their data set. That is huge.
Torsten: So where does this reduction come from? Is that infrastructure cost reduction?
Rachel: So, it’s the reduction in the amount of data that they’re storing. Customers send us all of their data, they transform it, and then they only pay us for the transformed persistent data that they keep at the end. So if they are able to reduce, not store half of their data, that’s half their data that they’re not paying us for. So, the way that they get there is, it’s not like filtering or dropping or anything like that. It’s transforming and aggregating, right? It’s performing mathematical computations and changing sampling, and sample rates, and basically letting them allocate resources to data that matters more. More resources to data that matters more and less resources. Your lab data is not as important as your critical production data. So, it just makes sense that you can make those decisions at a business level, and then that impacts the way your observability solution behaves.
Taking back control of your data
Torsten: Yeah. And that takes the risk out of experimenting with data where it’s just very costly, right? If you hook up more and more data streams and you have to analyze everything and pay for all of it, then you’re not experimenting. You’re not trying to find those hidden correlations and those magic patterns in end user behavior that transform your business, whether that at least can significantly move your bottom line, right? Customers don’t have to be afraid of just turning every stone to be successful.
Rachel: Yeah, that’s absolutely a big part of it. I think the other part is very practical and tactical. As companies’ SaaS observability bills get out of control, the only way that they can combat it is just removing the agent from systems. So, they’re flying blind. Or, they end up having one tool for their, one more expensive, better tool for their important stuff, and then a cheaper tool for their less important stuff. And then they just don’t have that central visibility that they need. So, people start to do really unnatural things to try and reduce their observability costs, which are high. And this just gives them the complete control to actually do this intelligently. Like they can control their observability costs in a way that actually aligns to their business needs.
Torsten: Yeah. You were just mentioning high priority and lower priority, but sometimes it’s just a relationship between seemingly innocent streams of data that can have a high priority impact, a big impact on your business. And you can only find out about that if you really get the big picture and ingest it all. And then you make the decision based on data science, where, in the background, you figure out, “are there interesting correlations?” “Is there something that makes sense to explore further?”
This is really to a big degree, a data science problem that you guys are solving on behalf of the customer. And you are seeing there’s a large amount of data, and there’s no information in that. So, I automatically downsample that data because the higher resolution doesn’t get me anything. But the downsampling can save me petabytes worth of data per year. And, that translates into hundreds of thousands or even millions of dollars for observability.
Collecting data and downsampling
Rachel: That’s exactly it. But, the important thing is we want to collect all the data at the finest granularity that customers want to send us, and after we’ve collected it, then we can analyze it and figure out what to do with it. So, there’s no sampling, right? WhenI talk about downsampling, it’s after we’ve collected the data. We might collect it at a 15 second interval and then down sample it to 15 minutes. But, we make those decisions post collection.
So, the other quick things I just wanted to call out. The control plan is so important but it’s not the only thing. Other big differentiators for us are reliability and availability. We’ve got the highest contract SLA of any SaaS observability solution out there. We guarantee three nines of uptime.
We’ve historically actually delivered four nines. This is much more available and reliable than any other solution out there. The performance and scalability is really unmatched – the ability to very quickly get engineers to the data they need, even at large scale. We work with all different sizes of customers, but we’ve been proven up to 2 billion data points per second.
The last piece that I don’t wanna belittle it, and we don’t have enough time to dig into it today, is really that it’s usable for everyone, whether they’re a casual user or an expert. And what we’ve found with a lot of other observability tools out there is they tend to cater to one audience or the other: it’s just for ninjas. It’s just for the people that can go to a blank command line and know exactly what to type in, or it’s just for beginners. And then the experts don’t want to use it. And we’ve been able to find this nice balance where it’s a democratized tool. Whether or not you know exactly what trace you’re looking for, you can get to what you need within the solution in a unified way with across metrics and traces.
Torsten: That’s a very important aspect of using data for the good of the business that everybody can contribute, and everybody has certain ideas of what could be the case, what could be reality, and what could be fiction. And most of us, in the past, were never able to really dig into the data and get to the bottom of that, right? As a developer for example, I may never know the impact that my code has for a specific type of user on a certain device, on a mobile device at a certain time of day. You can get a specific as you want, but every different user role from a developer to cloud engineer, DevOps engineer, security guy, we all have our theories, and we can all dig into data and then get more information, make more informed choices and decisions, and next time crank down our MTTR even further or increase the number of transactions on the same amount of hardware because we found out that we created some overhead that we can easily eliminate with a couple of lines of code or by leaving out a couple of lines of code.So there’s a lot of things that have big business impact that we can find if we bring everybody together in data science.
You can listen to the rest of the chat starting at 24:02