Observability talk filled the halls at KubeCon + CloudNativeCon NA 2022 in Detroit – increasing curiosity around the already hot topic. Chronosphere’s Field CTO Ian Smith stopped by theCUBE’s studio to chat with Lisa Martin and John Furrier.
The three chatted about observability, and how Chronosphere is delivering a new approach specifically designed for how cloud native engineering teams need to work and to reduce burnout.
Debunking the three pillars of observability
Lisa: Talk about the traditional three pillars approach toward observability. What are some of the challenges with that, and how does Chronosphere solve those?
Ian: People think of the three pillars as logs, metrics, and traces. What do you do with that? There’s no action there. It’s just data. You collect this data, you go put it somewhere, but it’s not actually talking about any sort of outcomes. That’s really the heart of the issue — you’re not achieving anything. You’re just collecting a whole bunch of data — where do you put it? What can you do with it? Those are the fundamental questions. One of the things that we are focused on at Chronosphere is, “What are those outcomes? What is the real value of that data?”
For example, thinking about the three phases of observability … when you’re trying to investigate something through observability, you probably want to know what’s going on. You want to triage any problems you detect. Finally, you want to understand the cause and be able to take longer term steps to address the problems.
Why observability must be accessible
John: What do customers do when they start thinking about observability? When they get over their skis and realize that they’re really not taking the right approach? What’s going on with the customer, the good and the bad?
Ian: The bad side is when you’re buying a lot of things or implementing — even in open source or when self-building — and the environment is very disconnected; [this means] you don’t have a workflow, you don’t have a path to success. If you ask different teams, “How do you address these particular problems?” They’re going to give you a bunch of different answers. If you ask about what their success rate is, it’s probably very uneven.
Another key indicator of problems is always needing a particular senior engineer to help answer particular performance problems. It’s a massive anti-pattern. Senior engineers need to be focused on innovation and competitive differentiation, but then they become the bottleneck. And you have this massive wedge of less experienced engineers — but no less valuable in the overall company perspective — who aren’t effective at being able to address these problems because the tooling isn’t right, the workflows are incorrect.
John: The senior engineers are getting pulled in to fix and troubleshoot what the observability data did or didn’t say.
Ian: It’s the promise of observability. A lot of people talk about unknown unknowns and there’s a lot of crafting complex queries. It’s a very romantic sort of deep dive approach, but realistically, you need to make observability very accessible.
The hidden costs of engineering
John: There are real hardcore costs that might be under the water, so to speak, like labor, senior engineering time … can you quantify and share an example or illustrate where the hidden costs are?
Ian: Hidden costs are actually far more important than the hard costs of infrastructure and licensing. There are many organizations out there using open source or observability components together and they think, “It’s free. No licensing costs.” But think about those outcomes.
Case in point: Having 15 teams and x number of incidents a month, and pulling a representative from every single one of those teams when there’s a problem. But it turns out that only two teams were required to remediate an issue. There are 13 individuals who do not need to be on the call. [Even if] I met my SLA and MTTR, but from a competitive standpoint, I’m comparing myself to a very similar organization that only needed to impact two engineers versus the 15 that I had over here. Who is going to be the most competitive, the most differentiated?
The hardest thing for VPs of Engineering to do is acquire and retain engineers. So why burn them out unnecessarily when you can achieve the same or a better result by thinking more clearly about your observability [strategy]? Reduce the number of people involved, reduce the number of senior engineers involved, and ultimately have those resources more focused on innovation.
Platform efficiency leads to developer efficiency
John: Platform engineering is the hottest topic at this event — it is becoming that new layer that enables developers.
Ian: Organizations really think about developer efficiency — developer productivity — because it’s about the outcomes. It’s not that they are saying, “We just need to keep the site reliable.” As we talked about, there are many different ways that you can burn unnecessary resources. But if you focus on developer efficiency and productivity, there’s retainment — there’s competitive differentiation.
DoorDash and Chronosphere: taming the data explosion
Lisa: While observability certainly helps a company reduce churn, attract more talent, can you talk about some of the business outcomes in the context of customer experience?
Ian: DoorDash is a great customer example. They are at KubeCon talking about their experience with Chronosphere — cloud native technologies, Prometheus, and other components that align with Chronosphere. DoorDash is a cloud native organization, but they were going through a transformation from StatsD to very heavy microservices — very heavy Kubernetes and orchestration. They did that during a massive explosion of the business, particularly during the last couple of years. This was hard to do in a cost effective way.
(You can watch this video to dive into DoorDash’s journey from StatsD To Prometheus with 10 million metrics/second.)
Lisa: As we wrap here, tell us about when you’re in customer conversations, what is the key factor behind Chronosphere’s success?
Ian: That we’re not fixated on technical features and functions — frankly gimmicks like, “what could you possibly do with these three three pillars of data?” It’s more about what we can do to solve organizational pain at a high level. Things like: What is the cost of the solution? Also, on the individual level, what exactly is an engineer trying to do, and how can our tooling improve their quality of life? This is something I’m very passionate about.
This interview has been edited for length and clarity. There is much more to glean from the conversation between these three experts. Catch the whole discussion here: