James McGuire: Hi, I am James McGuire, and today on eSpeaks, we’re talking about trends and best practices in observability – an emerging technology that enables companies to better understand what’s going on with their technical infrastructure. To discuss that, I’m joined by someone who knows a lot about the topic. With me is Ian Smith: Field CTO at Chronosphere. Ian, very good to have you with us today.
Ian Smith: Thanks, James. It’s great to be here.
James McGuire: So, I’ve got some questions for you, and I really want to talk about the observability market. There’s a lot going on there. Before we get started with that, I’d like to get a sense – please tell us what Chronosphere does. Obviously it’s a well-known company, so many people already know, but for those who aren’t as familiar, what does Chronosphere do for other companies?
Ian Smith: Chronosphere is the only observability platform that’s built to empower DevOps teams and SRE teams to control the speed scale and complexity that comes with the technology and organizational changes of a cloud native world, particularly as you adopt microservices containers at scale.
Like all monitoring and observability technology, the guiding light of Chronosphere is in facilitating engineers resolving their infrastructure and application issues, before they affect customer experiences and that bottom line. But we’re also trusted by the world’s most innovative brands – including DoorDash, Snap, and Zillow, to go beyond those fundamentals. So, reigning and costs improving, developer productivity, increasing that customer satisfaction and enabling more rapid innovation to develop and maintain the competitive advantage.
James McGuire: Certainly a market for that, no doubt about it. Let’s talk about the observability market. It feels like it’s come up in the world the last few years. I sense a lot more buzz about the market. As you survey the observability sector going into 2024, what trends do you see currently driving the sector?
Ian Smith: I think there are some interesting ones, not just in the tooling side of things, but across the board. And this sort of, in my mind, refers to the overarching observability strategy that I’m sure we’ll talk about as we go through. But, there are things like the adoption of open source, particularly in the areas of instrumentation and data collection. Prometheus has been popular for a long time, but things like OpenTelemetry I think are really hitting stride now, and it’s offering customers and engineering organizations a higher degree of control and choice and that I think, is also underpinning a lot of what we see on the tool selection side of things as well. Historically, there’s probably been a little bit of a best of breed slash: “I can assemble things myself,” where, maybe I have a hodgepodge of different things for metrics, a different tool for tracing our APM, and a different tool for logging and various other things sort of assembled together.
Oftentimes, I hear a lot of customers talking about: “Well, actually we have 6, 7, 8, 9 different observability tools that touch that experience. And there’s a really big focus on tool consolidation now. I think the underpinning that is a really key focus on what the value I’m getting at observability is, and: “Am I getting the correct outcomes that I’m looking for?” So, in some ways, and interestingly for those of us working in the vendor space, there’s a high degree of skepticism now about: “Well, what are you providing me other than features and functions?”
I think you’re underpinning a lot of these conversations, particularly with the current economic environment: “Am I getting value for money? Am I getting a return on my investment?” So, costs and utilization value: Those are also questions that people are heavily thinking about, as well.
James McGuire: Yeah, it seems like it’s a key question, and not to delve too deeply into this, but how do you counter this skepticism when someone says: “Hey, wait a second, do I really need an observability application?” It seems kind of like a no brainer, but what do you say to customers, in answer to that?
Ian Smith: Organizations definitely believe in the value of observability as a concept and obviously need the tooling. But, I think the skepticism comes in terms of: “What am I getting beyond the high level marketing message?” One of the things that I see in my role is a desire for customers to be able to understand; “What am I getting beyond the tools, the features and function, and am I having a partner in this?” Obviously, things like customer support, but what is the thought leadership? Where am I going? Am I going to be with this particular vendor and have increased value over time, over the course of a multi-year relationship? Or am I going to be able to only really take advantage of some features and functions and deal with low hanging fruit where all of the value and differentiation that I’m going to get is very heavily front loaded in the engagement.
Ian McGuire: That makes perfect sense. Alright, so I know there are some challenges in the observability sector. For some companies, it doesn’t work perfectly. Some want to do something that’s hard to do. What do you see as common challenges that companies face with their observability deployment?
Ian Smith: For example, as companies think: “I’m not getting value out of something,” you are dealing with issues like the growth of the cost of observability. And this is not just the hard dollar cost. It can be the infrastructure cost. It can be the cost of people involved in managing and extending the observability function. So for example, in this case, just deploying open source observability backends is not a panacea for handling cost, but handling that cost is really a big challenge. We see, oftentimes, companies coming to us talking about super linear growth in their costs. So, to be more specific, let’s say that your revenue is growing at a certain percentage, maybe your infrastructure, say for example, your AWS or GCP costs may be growing faster than that because you’re having to invest in R&D and expanding your footprints. But then, observability is growing even faster than that.
That’s obviously a big concern to a finance organization. It’s like: “Well, our revenue’s going at a certain rate, our biggest cost, maybe GCP or AWS, is growing at an even higher cost. And of course, the FinOps focus is there. But then observability, which as we talked about is critical for maintaining that customer experience, that’s growing even faster. So, it’s a very big challenge there. And then of course, we decide to move, but the friction of moving can be very high because we’ve locked ourselves into a proprietary ecosystem. And of course, that ties back into what I said before, in terms of the adoption of open source giving you more freedom. But, doing those things and making sure that any purposeful action that you take and observability ties into an overarching strategy is something that I’m hearing customers have a challenge around.
It’s not: “I can just buy a tool,” but now I need to think about specifically: “How does it fit into that strategy?” And “Am I really solving true problems there?” On the cost side of things, you end up having to throw a lot of people at the problem, because inherently many vendors are frankly disincentivized to help you manage cost. You see the dichotomy of some vendors in the market say: “Hey, we’ve added cost management features for your cloud – your AWS and your GCP. And some of the more skeptical reception from the market is: “Great. When are you going to provide similar tooling, so that my metrics or my logging bill is also decreasing?”
James McGuire: You’ve talked a bit about how to address those challenges. If a company came to you and said: “Ian, we’re having some struggles,” what are some ideas and best practices and tips for actually handling those challenges?
Ian Smith: Part of my role, not necessarily tactically selling on a day-to-day basis, but looking at the way the market is, is going and having some of those conversations with organizations as they think about the bigger picture, and also also seeing how companies have been successful in making some of these big transformations. As they go down in the cloud native pathway, I really think the core of it needs to be being very deliberate about an observability strategy. One customer we talked to really summed it up very well: “We keep buying tool after tool after tool, and we think that we’re buying solutions. But internally, when I asked around, what is the problem we’re actually trying to solve? We weren’t all on the same page.”
So, how can you possibly buy something thinking it’s a solution for a problem if you haven’t actually defined that problem upfront? And so, at a basic level, the observability strategy is not just tool selection, it’s: “What are we trying to accomplish? And usually, it’s tied into the business objectives.
It may be that we need to move to the cloud, it may be that we need to adopt cloud native really aggressively. And then, how does engineering facilitate what the business needs? And then, going from this top down basis: “What is the value we’re looking to derive from observability,” allows us to really filter down to what matters. And it may be that the ability to control costs is incredibly important. Or: “When we previously settled on our observability tooling, we were a much smaller company. We had a really big focus on observability being used by our much senior resources, and they drove the evaluation. Now, we may be ten or fifteen times more engineers, and it’s a very broad spread of experiences. So, we need tooling that facilitates us to not be bottlenecked on those most senior engineers. We need everyone, including the engineer who’s been here for four months who is about to go on call.”
How are they going to be able to take advantage of those things and not constantly page the same people who were very reliant upon from an innovation standpoint. So, being deliberate about those strategic objectives and then filtering that down. Maybe you don’t even need to change an observability tool. Maybe the key is that we need to direct a large portion of observability data into some other area, maybe a data lake, because we’ve been abusing our observability tooling. And these are all strategic initiatives that come out of really stepping back and looking at that bigger picture.
James McGuire: Let’s drill down to what Chronosphere does. In terms of really addressing the observability needs of its clients, what is the Chronosphere advantage?
Ian Smith: I think, fundamentally, I’ve worked with a lot of vendors in this space, in the last 10 years for better or worse. But to me, one of those key differentiators, the one that shines through a lot to our customers, is that fundamentally and philosophically, we’re built to solve organizational pain – which sounds very fluffy, I realize that. But to be more specific, if you think about the kinds of things in a company that sound bad or painful and constantly bring up friction – [such as] finance complaining, as I’ve mentioned before, about unpredictable overages or bills rapidly escalating. And the other side of it might be engineers saying: “I don’t have the visibility I need, I’ll drop that data because that appears to be the most recent thing we’ve added, and that’s the thing that tipped us over the edge of getting an overage, those are organizational pains: You are losing visibility, or people are not seeing the value of that observability investment. And so, being able to solve those problems ties back to the DNA of the company, and where it came from, which is ultimately that the founders were at Uber, and because of its scale, Uber wasn’t able to just go out and have that knee-jerk reaction, like: “Let’s buy this, let’s buy that.”
They had to be very deliberate. They had to come up with a strategy and they had to drill down into what really mattered because it was going to be a significant engineering investment to build tools that would support their particular needs, particularly because they were early adopters of cloud native. And so, some of the things that the team tackled there are very much part of the core Chronosphere offering.
The ability to control cost, we’ve sort of talked about it at a surface level, but one of the things Chronosphere does is it allows us to understand the value of your data: “Who’s using it? What are they using it for?” How much are they utilizing it?” And be able to compare across those data sets. Maybe you have some data over there that’s only used once every three months for a capacity plan, but it’s very small and footprint. There’s data over here. It’s used everyday – for instance, investigation.
But what pieces of that are actually core to what you need to answer those questions? Secondarily, that will refine the data set down dynamically and from a business perspective, only charge you for the data that matters and that you’ve chosen to store with us. So, you can generate as much data as you want, figure out what matters to you, reshape that on the fly, and then pay for that. And so that, as a core capability, really helps you address that fundamental cost problem of: “Well, data’s exploding. I have all this stuff coming from containers and microservices. I don’t know what’s valuable. And, I’m under a lot of pressure from finance, particularly in the current economy, not to double or triple that over the next 12 months.”
James McGuire: You use an interesting phrase – of course we know what cloud native means in general. But what does it mean in terms of the observability sector and why is it an advantage?
Ian Smith: From the observability perspective, you tend to get a massive increase in data volumes. Maybe you were a web-first company deployed in five different regions. In each of those regions, you have a big monolith, maybe even just running in one VM. It’s generating certain data points about performance and those kinds of things. You basically just have five sources of data. And then containers, deployment, ignore microservices for a second. You just have a containers deployment and maybe you are looking at fifty to a hundred different containers in every single region. You’ve now increased the volume of data just because by nature of running those in containers, fifty to a hundred times. Then you add on microservices, you have another multiplier on top of that, I’ve decomposed my monolith into fifteen different microservices. So, there’s a volume of data problem, which obviously speaks to the cost piece that I mentioned before. But then, a secondary aspect of it is that things just got a lot more complicated for anyone to be able to operate and particularly diagnose where problems exist.
And so, it feeds into another problem of: “Okay, my engineers used to be able to pretty readily keep all the context in their head. They had enough tribal knowledge, enough context built up that: ‘Oh, I can look at these logs, I can look at these metrics and I can solve my problem.’ But now, with this complexity, I don’t know where the problem lies. Also, I’m not talking to the teams who are building the new microservices on a maybe weekly or monthly basis. And so, now I don’t know where to look. I need to adopt distributor tracing, but I’m not a distributor tracing expert.” And so, this aspect, and I alluded to before of the audience, sort of sophistication, and the complexity problem, really is a big struggle and making sure that the tooling can speak to everyone.
And also, that you as an engineering organization are able to put your fingerprints on things, build your workflows and not just be beholden to an unrealistic ideal that the vendor has decided: “Oh, this is great from a marketing purpose. This is the A-B-C-D-E sequence that you should always take to investigate a problem. Oftentimes, I hear engineers are very frustrated because their most senior resources are like: “Oh, well it’s super easy. You just go from A to E, but the product doesn’t support you going and skipping B, C and D, right?” So, being able to bring those kinds of things to the audience and allow them to have that distillation of institutional tribal knowledge is very powerful.
James McGuire: The big question is the future of observability. I think companies want to know so they can get ready now: “What do you see a few years ahead, in terms of observability? Where’s the sector going?”
Ian Smith: I think particularly, as we mentioned before, the open source instrumentation and defining what data is important to you and being in control of that destiny, is going to become very commonplace. We really have those sorts of early adopters in there, and they’re really pushing the development of things like OpenTelemetry. But I think seeing the vendor adoption is also another piece of that. And I think that’s going to extend forward into various other parts of it: “How do I control that data? How do I interact with it as well?”
I think that extending the skepticism thing that we mentioned before, companies are going to become a lot more focused on what the outcomes are, and the value. It’s going to be less about tactical features.
Of course, features are always going to be important, but the level of inspection of: “Well, what are the outcomes? How can I be confident about the adoption of this? How can I be confident about lowering the barrier entry for my junior developers? How can I be confident in the ability for my senior engineers to not be on-call as much?” So, these are those fundamental things that maybe don’t immediately map to a feature or function, but the market is going to become very picky about and force a bit of a change there.
I do think that the AI concept has some very interesting applications. I’ll be a little bit of a wet blanket and say we can’t just pin our hopes on AI and just sort of sit back.
It’s sort of just a case of “Let’s pump all of our data over to AI, and have it come back to us.” But I do think, particularly generative AI has very important ramifications about how we express ourselves to an observability system, and how it can express itself back to us. So to be specific, anyone who’s spent a bunch of time in observability has probably struggled with either a point and click or a query language interface of: “I want this data and I want it in a certain way.” And ultimately, that is expressing yourself to the system. But, from a natural language perspective, you can say: “Tell me what’s going wrong with this part of the system.”
And then be able to use those sort of same ways of expressing yourself to really narrow down the search: “Discard all the data from non-paying customers. Let me focus on just paying customers. Which customers are currently being affected? What’s common about the customers who are being affected? How long has this been going on?” These are all things that we have been asking since the dawn of time for monitoring solutions. But they’re still quite difficult for us to express in the system. So, generative AI has that opportunity to really advantage us there.
And then, also the data coming back, right? It’s one thing for a raw feed of data to come through and fill up a terminal, and hopefully we’re past those points. But even the very dense graphs and dashboards that we’ve built up, they don’t necessarily highlight to us exactly what we should be looking at. So the system says, for example: “Here’s a plethora of data. Here’s a dashboard, but you should probably look here, here and here.” And also: “Last time, these things were caused by this, this, and this, and this is what you did to solve them.”
It’s really just allowing the human decision making factor to come to the fore rather than that very heavy manual exploration of the data that’s oftentimes reliant on tribal knowledge.James McGuire: Interesting. It’s a lot of good information. I think I definitely learned something. Thank you so much for sharing your expertise today and please come back and talk with us again sometime.
Ian Smith: Of course. Thank you so much, James.