Martin Mao, CEO & Cofounder at Chronosphere, joins Corey on Screaming in the Cloud to discuss the trends he sees in the observability industry. Martin explains why he feels measuring observability costs isn’t nearly as important as understanding the velocity of observability costs increasing, and why he feels efficiency is something that has to be built into processes as companies scale new functionality. Corey and Martin also explore how observability can now be used by business executives to provide top line visibility and value, as opposed to just seeing observability as a necessary cost.
Co-Founder and CEO
Chief Cloud Economist
The Duckbill Group
Corey is the Chief Cloud Economist at The Duckbill Group. Corey’s unique brand of snark combines with a deep understanding of AWS’s offerings, unlocking a level of insight that’s both penetrating and hilarious. He lives in San Francisco with his spouse and daughters.
Announcer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.
Corey: Human-scale teams use Tailscale to build trusted networks. Tailscale Funnel is a great way to share a local service with your team for collaboration, testing, and experimentation. Funnel securely exposes your dev environment at a stable URL, complete with auto-provisioned TLS certificates. Use it from the command line or the new VS Code extensions. In a few keystrokes, you can securely expose a local port to the internet, right from the IDE.
I did this in a talk I gave at Tailscale Up, their first inaugural developer conference. I used it to present my slides and only revealed that that’s what I was doing at the end of it. It’s awesome, it works! Check it out!
Their free plan now includes 3 users & 100 devices. Try it at snark.cloud/tailscalescream
Corey: Welcome to Screaming in the Cloud. I’m Corey Quinn. This promoted guest episode is brought to us by our friends at Chronosphere. It’s been a couple of years since I got to talk to their CEO and co-founder, Martin Mao, who is kind enough to subject himself to my slings and arrows today. Martin, great to talk to you.
Martin: Great to talk to you again, Corey, and looking forward to it.
Corey: I should probably disclose that I did run into you at Monitorama a week before this recording. So, that was an awful lot of fun to just catch up and see people in person again. But one thing that they started off the conference with, in the welcome-to-the-show style of talk, was the question about benchmarking: what observability spend should be as a percentage of your infrastructure spent. And from my perspective, that really feels a lot like a question that looks like, “Well, how long should a piece of string be?” It’s always highly contextual.
Martin: Mm-hm.
Corey: Agree, disagree, or are you hopelessly compromised because you are, in fact, an observability vendor, and it should always be more than it is today?
Martin: I would say, definitely agree with you from an exact number perspective. I don’t think there is a magic number like 13.82% that this should be. It definitely depends on the context of how observability is used within a company, and really, ultimately, just like anything else you pay for, it really gets derived from the value you get out of it. So, I feel like if you feel like you’re getting the value out of it, it’s sort of worth the dollars that you put in.
I do see why a lot of companies out there and people are interested because they’re trying to benchmark, to trying to see, am I doing best practice? So, I do think that there are probably some best practice ranges that I’d say most typical organizations out there that we see. This is one thing I’d say. The other thing I’d say when it comes to observability costs is one of the concerns we’ve seen talking with companies is that the relative proportion of that cost to the infrastructure is rising over time. And that’s probably a bad sign for companies because if you extrapolate, you know, if the relative cost of observability is growing faster than infrastructure, and you extrapolate that out a few years, then the direction in which this is going is bad. So, it’s probably more the velocity of growth than the absolute number that folks should be worried about.
Corey: I think that that is probably a fair assessment. I get it all the time, at least in years past, where companies will say, “For every 1000 daily active users, what should it cost to service them?” And I finally snapped in one of my talks that I gave at DevOps Enterprise Summit, and said, I think it was something like $7.34.
Martin: [laugh]. Right, right.
Corey: It’s an arbitrary number that has no context on your business, regardless of whether those users are, you know, Twitter users or large banks you have partnerships with. But now you have something to cite. Does it help you? Not really. But we’ll it get people to leave you alone and stop asking you awkward questions?
Martin: Right, right.
Corey: Also not really, but at least now you have a number.
Martin: Yeah, a hundred percent. And again, like I said, there’s no—and glad magic numbers weren’t too far away from each other. But yeah, I mean, there’s no exact number there, for sure. One pattern I’ve been seeing more recently is, like, rather than asking for the number, there’s been a lot more clarity in companies on figuring out, “Well, okay, before even pick what the target should be, how much am I spending on this per whatever unit of efficiency is?” Right?
And generally, that unit of efficiency, I’ve actually seen it being mapped more to the business side of things, so perhaps to the number of customers or to customer transactions and whatnot. And those things are generally perhaps easier to model out and easier to justify as opposed to purely, you know, the number of seats or the number of end-users. But I’ve seen a lot more companies at least focus on the measurement of things. And again, it’s been more about this sort of, rather than the absolute number, the relative change in number because I think a lot of these companies are trying to figure out, is my business scaling in a linear fashion or sub-linear fashion or perhaps an exponential fashion, if it’s—the costs are, you know, you can imagine growing exponentially, that’s a really bad thing that you want to get ahead of.
Corey: That I think is probably the real question people are getting at is, it seems like this number only really goes up and to the right, it’s not something that we have any real visibility into, and in many cases, it’s just the pieces of it that rise to the occasion. A common story is someone who winds up configuring a monitoring system, and they’ll be concerned about how much they’re paying that vendor, but ignore the fact that, well, it’s beating up your CloudWatch API charges all the time on this other side as well, and data egress is not free—surprise, surprise. So, it’s the direct costs, it’s the indirect costs. And the thing people never talk about, of course, is the cost of people to feed and maintain these systems.
Martin: Yeah, a hundred percent, you’re spot on. There’s the direct costs, there’s the indirect costs. Like you mentioned, in observability, network egress is a huge indirect cost. There’s the people that you mentioned that need to maintain these systems. And I think those are things that companies definitely should take into account when they think about the total cost of ownership there.
I think what’s more in observability actually is, and this is perhaps a hard thing to measure, as well, is often we ask companies, “Well, what is the cost of downtime?” Right? Like if you’re, if your business is impacted and your customers are impacted and you’re down, what is the cost of each additional minute of downtime, perhaps, right? And then the effectiveness of the tool can be evaluated against that because you know, observability is one of these, it’s not just any other developer tool; it’s the thing that’s giving you insight into, is my business or my product or my service operating in the way that I intend. And, you know, is my infrastructure up, for example, as well, right? So, I think there’s also the piece of, like, what is the tool really doing in terms of, like, a lost revenue or brand impact? Those are often things that are sort of quite easily overlooked as well.
Corey: I am curious to see whether you have noticed a shifting in the narrative lately, where, as someone who sells AWS cost optimization consulting as a service, something that I’ve noticed is that until about a year ago, no one really seemed to care overly much about what the AWS bill was. And suddenly, my phone’s been ringing off the hook. Have you found that the same is true in the observability space, where no one really cared what the observability cost, until suddenly, recently, everyone does or has this been simmering for a while?
Martin: We have found that exact same phenomenon. And what I tell most companies out there is, we provide an observability platform that’s targeted at cloud-native platforms. So, if your—a cloud-native architecture, so if you’re running microservices-oriented architecture on containers, that’s the type of architecture that we’ve optimized our solution for. And historically, we’ve always done two things to try to differentiate: one is, provide a better tool to solve that particular problem in that particular architecture, and the second one is to be a more cost-efficient solution in doing so. And not just cost-efficient, but a tool that shows you the cost and the value of the data that you’re storing.
So, we’ve always had both sides of that equation. And to your point, in conversations in the past years, they’ve generally been led with, “Look, I’m looking for a better solution. If you just happen to be cheaper, great. That’s a nice cherry on top.” Whereas this year, the conversations have flipped 180, in which case, most companies are looking for a more cost-efficient solution. If you just happen to be a better tool at the same time, that’s more of a nice-to-have than anything else. So, that conversation has definitely flipped 180 for us. And we found a pretty similar experience to what you’ve been seeing out in the market right now.
Corey: Which makes a tremendous amount of sense. I think that there’s an awful lot of—oh, we’ll just call it strangeness, I think. That’s probably the best way to think about it—in terms of people waking up to the grim reality that not caring about your bills was functionally a zero-interest-rate phenomenon in the corporate sense. Now, suddenly, everyone really has to think about this in some unfortunate and some would say displeasing ways.
Martin: Yeah, a hundred percent. And, you know, it was a great environment for tech for over a decade, right? So, it was an environment that I think a lot of companies and a lot of individuals got used to, and perhaps a lot of folks that have entered the market in the last decade don’t know of another situation or another set of conditions where, you know, efficiency and cost really do matter. So, it’s definitely top of mind, and I do think it’s top of mind for good reason. I do think a lot of companies got fairly inefficient over the last few years chasing that top-line growth.
Corey: Yeah, that has been—and I think it makes sense in the context with which people were operating. Because before a lot of that wound up hitting, it was, well grow, grow, grow at all costs. “What do you mean you’re not doing that right now? You should be doing that right now. Are you being irresponsible? Do we need to come down there and talk to you?”
Martin: A hundred percent.
Corey: Yeah, it’s like eating your vegetables. Now, it’s time to start paying attention to this.
Martin: Yeah, a hundred percent. It’s always a trade-off, right? It’s like in an individual company and individual team, you only have so many resources and prioritization. I do think, to your point, in a zero interest environment, trying to grow that top line was the main thing to do, and hence, everything was pushed on how quickly can we deliver new functionality, new features, or grow that top line. Whereas, the efficiency is always something I think a lot of companies looked at as something I can go deal with later on and go fix. And you know, I feel like that that time has now just come.
Corey: I will say that I somewhat recently had the distinct privilege of working with a company whose observability story was effectively, “We wait for customers to call and tell us there’s a problem and then we go looking in into it.” And on the one hand, my immediate former SRE reflexes kicked in, and I recoiled, but this company has been in this industry longer than I have. They clearly have a model that is working for them and for their customers. It’s not the way I would build something, but it does seem that for some use cases, you absolutely are going to be okay with something like that. And I should probably point out, they were not, for example, a bank where yeah, you kind of want to get some early warning on things that could destabilize the economy.
Martin: Right, right. I mean, to your point, depending on the context, and the company, it could definitely make sense, and depending on how they execute it as well, right? So, you know, you called out an example already, where if they were a bank or if any correctness or timeliness of a response was important to that business, perhaps not the best thing to do to have your customers find out, especially if you have a ton of customers at the same time. But however, you know, if it’s a different type of business where, you know, the responses are perhaps more asynchronous or you don’t have a lot of users encountering at the same time or perhaps you have a great A/B experimentation platform and testing platform, you know, there are definitely conditions in which that could be potentially a viable option.
Especially when you weigh up the cost and the benefit, right? If the cost to having a few bad customers have a bad experience is not that much to the business and the benefit is that you don’t have to spend a ton on observability, perhaps that’s a trade-off that the company is willing to make. In most of the businesses that we’ve been working with, I would say that probably not been the case, but I do think that there’s probably some bias and some skew there in the sense that you can imagine a company that cares about these things, perhaps it’s more likely to talk to an observability vendor like us to try to fix these problems.
Corey: When we spoke a few years back, you definitely were focused on the large, one would say, almost hyperscale style of cloud-native build-out. Is that still accurate or has the bar to entry changed since we last spoke? I know you’ve raised an awful lot of money, which good for you. It’s a sign of a healthy, robust VC ecosystem. What the counterpoint to that is, they’re probably not investing in a company whose total addressable market is, like, 15 companies that must be at least this big.
Martin: [laugh]. Yeah, a hundred percent. A hundred percent. So, I’ll tell you that the bar to entry definitely has changed, but it’s not due to a business decision on our end. If you think about how we started and, you know, the focus area, we’re really targeting accounts that are adopting cloud-native technology.
And it just so happens that the large tech, [decacorns 00:12:35], and the hyperscalers were the earliest adopters of cloud-native. So containerization, microservices, they were the earliest adopters of that, so hence, there was a high correlation in the companies that had that problem and the companies that we could serve. Luckily, for us, the trend has been that more of the rest of the industry has gone down this route as well. And it’s not just new startups; you can imagine any new startup these days probably starts off cloud-native from day one, but what we’re finding is the more established, larger enterprises are doing this shift as well. And I think the folks out there like Gartner have studied this and predicted that, you know, by about 2028, I believe was the date, about 95% of applications are going to be containerized in large enterprises. So, it’s definitely a trend that the rest of the industry will go on. And as they continue down that trend, that’s when, sort of, our addressable market will grow because the amount of use cases where our technology shines will grow along with that as well.
Corey: I’m also curious about your description of being aimed at cloud-native companies. You gave one example of microservices powered by containers, if I understood correctly. What are the prerequisites for this? When you say that it almost sounds like you’re trying to avoid defining a specific architecture that you don’t want to deal well with or don’t want to support for a variety of reasons? Is that what it is or is there certain you must be built in these ways or the product does not work super well for you? What is it you’re trying to say with that, is what I’m trying to get at here.
Martin: Yeah, a hundred percent. If you look at the founding story here, it’s really myself and my co-founder, found Uber going through this transition of both a new architecture, in the sense that, you know, they were going containers, they were building microservices-oriented architecture there, were also adopting a DevOps mentality as well. So, it was just a new way of building software, almost. And what we found is that when you develop software in this particular way—so you can imagine when you’re developing a tiny piece of functionality as a microservice and you’re a individual developer, and you’re—you know, you can imagine rolling that out into production multiple times a day, in that way of developing software, what we found was that the traditional tools, the application performance monitoring tools, the IT monitoring tools that used to exist pre this way of both architecture and way of developing software just weren’t a good fit.
So, the whole reason we exist is that we had to figure out a better way of solving this particular problem for the way that Uber built software, which was more of a cloud-native approach. And again, it just so happens that the rest of the industry is moving down this path as well and hence, you know, that problem is larger for a larger portion of the companies out there. You know, I would say some of the things when you look into why the existing solutions can’t solve these problems well, you know, if you look at a application performance monitoring tool, an APM tool, it’s really focused on introspecting into that application and its interaction with the operating system or the underlying hardware. And yet, these days, that is less important when you’re running inside the container. Perhaps you don’t even have access to the underlying hardware, or the operating system and what you care about—you can imagine—is how that piece of functionality interacts with all the other pieces of functionality out there, over a network core.
So, just the architecture and the conditions ask for a different type of observability, a different type of monitoring, and hence, you just need a different type of solution to go solve for this new world. Along with this, which is sort of related to the cost as well, is that, you know, as we go from virtual machines onto containers, you can imagine the sheer volume of data that gets produced now because everything is much smaller than it was before and a lot more ephemeral than it was before, and hence, every small piece of infrastructure, every small piece of code, you can imagine still needs as much monitoring and observability as it did before, as well. So, just the sheer volume of data is so much larger for the same amount of infrastructure, for the same amount of hardware that that you used to have, and that’s really driving a huge problem in terms of being able to scale for it and also being able to pay for these systems as well.
Corey: Tired of Apache Kafka’s complexity making your AWS bill look like a phone number? Enter Redpanda. You get 10x your streaming data performance without having to rob a bank. And migration? Smoother than a fresh jar of peanut butter. Imagine cutting as much as 50% off your AWS bills. With Redpanda, it’s not a dream, it’s reality. Visit go.redpanda.com/duckbill. Redpanda: Because Kafka shouldn’t cause you nightmares.
Corey: I think that there’s a common misconception in the industry that people are going to either have ancient servers rotting away in racks, or they’re going to build something greenfield, the way that we see done on keynote stage is all the time of companies that have been around with this architecture for less than 18 months. In practice, I find it’s awfully frequent that this is much more of a spectrum, and a case-by-case per-workload basis. I haven’t met too many data center companies where everything’s the disaster that the cloud companies like to paint it as, and vice versa, I also have never yet seen a architecture that really existed as described in a keynote presentation.
Martin: A hundred percent agree with you there. And you know, it’s not clean-cut from that perspective. And also, you’re also forgetting the messy middle as well, right? Like, often what happens is, there’s a transition. If you don’t start off cloud-native from day one, you do need to transition there from your monolithic applications, from your VM-based architectures, and often the use case can’t transform over perfectly.
What ends up happening is you start moving some functionality and containerizing some functionality and that still has dependencies between the old architecture and the new architecture. And companies have to live in this middle state, perhaps for a very long time. So, it’s definitely true. It’s not a clean-cut transition. But you can think about that middle state is actually one that a lot of companies struggle with because all of a sudden, you only have a partial view of the world, or what’s happening with your old tools, they’re not well suited for the new environments. Perhaps you got to start bringing new tools and new ways of doing things in your new environments, and they’re not perhaps the best suited for the old environment as well.
So, you do actually end up in this middle state where you need a good solution that can really handle both because there are a lot of interdependencies between the two. And it’s actually one of the things that we strive to do here at Chronosphere is to help companies through that transition. So, it’s not just all of the new use cases and it’s not just all of your new environments. It’s actually helping companies through this transition is actually pretty critical as well.
Corey: My question for you is, given that you have, I don’t want to say a preordained architecture that your customers have to use, but there are certain assumptions you’ve made based upon both their scale and the environment in which they’re operating. How heavy of a lift is it for them to wind up getting Chronosphere into their environments? Just because seems to me that it’s not that hard to design an architecture on a whiteboard that can meet almost any requirement. The messy part is figuring out how to get something that resembles that into place on a pre-existing, extant architecture.
Martin: Yeah. I’d say it’s something we spent a lot of time on. The good thing for the industry overall, for the observability industry, is that open-source standards are now created and now exist when they didn’t before. So, if you look at the APM-based view, it was all proprietary agents producing the data themselves that would only really work with one vendor product, whereas if you’ve look at a modern environment, the production of the data has actually been shifted from the vendor down to the companies themselves, and there’ll be producing these pieces of data in open-source standard formats like OpenTelemetry for distributed traces, or perhaps Prometheus for metrics.
So, the good thing is that for all of the new environments, there’s a standard way to produce all of this data and you can send all that data to whichever vendor you want on the back end. So, it just makes the implementation for the new environments so much easier. Now, for the legacy environments, or if a company is shifting over from an existing tool, there is actually a messy migration there because often you’re trying to replace proprietary formats and proprietary ways of producing data with open-source standard ones. So, just something that us as Chronosphere just come in and we view that as a particular problem that we need to solve and we take the responsibility of solving for a company because what we’re trying to sell companies is not just a tool, what we’re really trying to solve them is the solution to the problem, and the problem is they need an observability solution end to end. So, this often involves us coming in and helping them, you can imagine, not just convert the data types over but also move over existing dashboards, existing alerts.
There’s a huge piece of lift that the end—that perhaps every developer in a company would have to do if we didn’t come in and do it on behalf of those companies. So, it’s just an additional responsibility. It’s not an easy thing to do. We’ve built some tooling that helps with it, and we just spend a lot of manual hours going through this, but it’s a necessary one in order to help a company transition. Now, the good thing is, once they have transitioned into the new way of doing things and they are dependent on open-source standard formats, they are no longer locked in. So, you know, you can imagine future transitions will be much easier, however the current one does have to go through a little bit of effort.
Corey: I think that’s probably fair. And then there’s no such thing, in my experience, as a easy deployment for something that is large enough to matter. And let’s be clear, people are not going to be deploying something as large scale as Chronosphere on a lark. This is going to be when they have a serious application with serious observability challenges. So, it feels like, on some level, that even doing a POC is a tricky proposition, just due to the instrumentation part of it. Something I’ve seen is that very often, enterprise sales teams will decide that by the time that they can get someone to successfully pull off a POC, at that point, the deal win rate is something like 95% just because no one wants to try that in a bake-off with something else.
Martin: Yeah, I’d say that we do see high pilot conversion rates, to your point. For us, it’s perhaps a little bit easier than other solutions out there, in the sense that I think with our type of observability tooling, the good thing is, an individual team could pick this up for their one use case and they could get value out of it. It’s not that every team across an environment or every team in an organization needs to adopt. So, while generally, we do see that, you know, a company would want to pilot and it’s not something you can play around online with by yourself because it does need a particular deployment, it does need a little bit of setup, generally one single team can come and perform that and see value out of the tool. And that sort of value can be extrapolated and applied to all the other teams as well. So, you’re correct, but it hasn’t been a huge lift. And you know, these processes end to end, we’ve seen be as short as perhaps 30-something days end to end, which is generally a pretty fast-moving process there.
Corey: Now, I guess, on some level, I’m still trying to wrap my head around the idea of the scale that you operate at, just because as you mentioned, this came out of Uber—which is beyond imagining for most people—and you take a look at a wide variety of different use cases. And in my experience it’s never been, “Holy crap, we have no observability and we need to fix that.” It’s, “There are a variety of systems in place that just are not living up to the hopes, dreams, and potential that they had when they were originally deployed.” Either due to growth or due to lack of product fit, or the fact that it turns out in a post zero-interest-rate world, most people don’t want to have a pipeline of 20 discrete observability tools.
Martin: Yep, yep. No, a hundred percent. And, to your point there, ultimately, it’s our goal and, you know, in many companies were replacing up to six to eight tools in a single platform. And so, it’s great to do. That definitely doesn’t happen overnight. It takes time.
You know, you can imagine in a pilot or when you’re looking at it, we’re picking a few of the use cases to demonstrate what our tool could do across many other use cases, and then generally on the onboarding, during the onboarding time or perhaps over a period of months or perhaps even a year plus, we then go on board these use cases a piece by piece. So, it’s definitely not a quick overnight process there, but, you know, you can imagine something that can help each end developer in that particular company be more effective and it’s something that can really help move the bottom line in terms of far better price efficiency. These things are generally not things that are quick fixes; these are generally things that do take some time and a little bit of investment to achieve the results.
Corey: So, a question I do have for you, given that I just watched an awful lot of people talking about observability for three days at Monitorama, what are people not talking about? What did you not see discussed that you think should be?
Martin: Yeah, one thing I think often gets overlooked, and especially in today’s climate is, I think observability gets relegated to a cost center. It’s something that every company must have, every company has today, and it’s often looked at a tool that gives you insights about your infrastructure and your applications and it’s a backend tool, something you have to have, something you have to pay for and it doesn’t really move the direct needle for the business top line. And I think that’s often something that companies don’t talk about enough. And you know, from our experience at Uber and through most of the companies that we work with here at Chronosphere, yes, there are infrastructure problems and application level problems that we help companies solve, but ultimately, the more mature organizations, or when it comes to observability, are often starting to get real-time insights into the business more than the application layer and the infrastructure layer.
And if you think about it, companies that are cloud-native architected, there’s not one single endpoint or one single application that fulfills a single customer request. So, even if you could look at all the individual pieces, the actual what we have to do for customers in our products and services span across so many of them that often you need to introduce a new view, a view that’s just focused on your customers, just focused on the business, and sort of apply the same type of techniques on your backend infrastructure as you do for your business. Now, this isn’t a replacement for your BI tools, you still need those, but what we find is that BI tools are more used for longer-term strategic decisions, whereas you may need to do a lot of sort of tactical, more tactical, business operational functions based on having a live view of the business. So, what we find is often observability is only ever thought about for infrastructure, it’s only ever thought about as a cost center, but ultimately observability tooling can actually add a lot directly to your top line by giving you visibility into the products and services that make up that top line. And I would say the more mature organizations that we work with here at Chronosphere all had their executives looking at, you know, monitoring dashboards to really get a good sense of what’s happening in their business in real-time. So, I think that’s something that hopefully a lot more companies evolve into over time and they really see the full benefit of observability and what it can do to a business’s top line.
Corey: I think that’s probably a fair way of approaching it. It seems similar, in some respects, to what I tend to see over in the cloud cost optimization space. People often want to have something prescriptive of, do this, do that do the other thing, but it depends entirely what the needs of the business are internally, it depends upon the stories that they wind up working with, it depends really on what their constraints are, what their architectures are doing. Very often it’s a let’s look and figure out what’s going on and accidentally, they discover they can blow 40% off their spend by just deleting things that aren’t in use anymore. That becomes increasingly uncommon with scale, but it’s still one of those questions of, “What do we do here and how?”
Martin: Yep, a hundred percent.
Corey: I really want to thank you for taking the time to speak with me today about what you’re seeing. If people want to learn more, where’s the best place for them to find you?
Martin: Yeah, the best place is probably to go to our website Chronosphere.io to find out more about the company, or if you want to chat with me directly, LinkedIn is probably the best place to come find me, via my name.
Corey: And we will, of course, put links to both of those things in the. Thank you so much for suffering the slings and arrows I was able to throw at you today.
Martin: Thank you for having me Corey. Always a pleasure to speak with you, and looking forward to our next conversation.
Share This: