Screaming in the Cloud: What goes into an observability strategy?

A green background with a light bulb and a check mark, representing an observability strategy in the "Screaming in the Cloud" context.
ACF Image Blog

Chronosphere’s Rachel Dines chats with Corey Quinn about the necessity of an observability strategy, getting ROI out of tooling, and cost challenges.

A green circle with a black hole in the middle.
Chronosphere Staff | 

As part of digital transformation initiatives, organizations are continuing to adopt observability to gain insight into their cloud native environments and ensure all systems are online. However, it’s not as simple as just purchasing an observability tool; teams need the right processes, staff, and business goals in place to reap the benefits and ensure costs are kept reasonable. 

Rachel Dines, Chronosphere’s Head of Product and Solutions Marketing, recently stopped by the Screaming in the Cloud podcast to reiterate this point, and chat about what exactly is involved in observability tool and process adoption.  

If you don’t have time to listen to the full podcast, be sure to check out the transcript below.

The evolution of the observability market

Corey Quinn: Hello and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at the Duckbill Group, Cory Quinn. This weekly show features conversations with people doing interesting work in the world of cloud. Thoughtful commentary on the state of the technical world and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Welcome to Screaming in the Cloud. I’m Cory Quinn. Today’s featured guest episode is brought to us by our friends at Chronosphere and they have also brought us Rachel Dines, their Head of Product and Solutions Marketing. Rachel, great to talk to you again.

Rachel Dines: Hey Corey, great to talk to you too.

Corey: Watching your trajectory has been really interesting just because starting off, when we first started, I guess learning who each other were, you were working at Cloud Health, which has since become VMware and I was trying to figure out the cloud runs on money. How about that? It feels like it was a thousand years ago, but neither one of us is quite that old.

Rachel: It does feel like several lifetimes ago you were just this snarky guy with a few followers on Twitter and I was trying to figure out what you were doing mucking around with my customers. We kind of both figured out what we were doing.

Corey: Speaking of that iterative process, today you are at Chronosphere, which is an observability company. We would’ve called it a monitoring company five years ago, but now that’s become an insult after the observability war has settled. I wanna talk to you about something that I’ve been kicking around for a while, because I feel like there’s a gap somewhere. 

Let’s say that I build a crappy web app because all of my web apps inherently are crappy and it makes money through some mystical form of alchemy. I have a bunch of users and I eventually realize, “I should probably have a better observability story than waiting for the phone to ring and a customer telling me it’s broken.” So I start instrumenting various aspects of it that seem to make sense. Maybe I go too low level, like looking at all the disks on every server to tell me if they’re getting full or not. Like they’re ancient servers. Maybe I just have a Pingdom equivalent of “Is the website up enough to respond to a packet?”

And as I wind up experiencing different failure modes and getting yelled at by different constituencies in my own career trajectory, my own boss, you start instrumenting for all those different kinds of breakages. You start aggregating the logs somewhere and the volume gets bigger and bigger with time. But it feels like it’s sort of a reactive process as you stumble through that entire environment. I know it’s not just me because I’ve seen this unfold in similar ways in a bunch of different companies. It feels to me very strongly like it is something that happens to you rather than something you set about from day one with a strategy in mind. What’s your take on an effective way to think about strategy when it comes to observability?

Rachel: You just nailed it. That’s exactly the kind of progression that we so often see. And that’s what I really was excited to talk with you about today. 

Corey: I was worried for a minute there. That’d be like, “What? What the hell are you talking about? Are you just like some sort of crap engineer?” What I’m trying to figure out is [if there was] some magic that I just was never connecting? Because it always feels like you’re in trouble because the site’s always broken and Oh, like if the disc fills up, yeah, oh now we’re gonna start mounting to make sure the disc doesn’t fill up. Then you wind up getting barraged with alerts and no one wins and it’s an uncomfortable period of time.

Rachel: Uncomfortable period of time. That is one very polite way to put it. I mean, I will say it is, it’s very rare to find a company that actually sits down and thinks, “This is our observability strategy, this is what we wanna get out of observability.” Like you can think about a strategy in the old school sense and you know, I was an industry analyst so I’m gonna have to go back to like my roots at Forrester with thinking about the people and the process and the technology. 

But really what the bigger component here is: What’s the business impact? What do you wanna get out of your observability platform? What are you trying to achieve? And a lot of the time people have thought,”observability strategy, great, I’m just gonna buy a tool.” That’s it. Like that’s my strategy and I hate to bring it to you, but buying tools is not a strategy. I’m not gonna say I buy this tool. I’m not even gonna say buy Chronosphere. That’s not a strategy when you should buy Chronosphere. But that’s not a strategy.

Corey: Of course I’m gonna throw money by the wheelbarrow at various observability vendors and hope it solves my problem. But if that solved the problem, I’d have to be direct. I’ve never spoken to those customers.

Rachel: Exactly. I mean that’s why this space is such a great one to come in and be very disruptive in. And I think back in the days when, you know, we were running in data centers, maybe even before virtual machines, you could probably get away with not having a monitoring strategy. I’m not gonna call it observability, it’s not, we called it back then you get away with not having a strategy because what was the worst that was gonna happen, right? 

If it wasn’t like there was a finite amount that your monitoring bill could be. There was a finite amount that your customer impact could be like you’re paying the penny slots, right? We’re not in the penny slots anymore, we’re in the $50 craps table and it’s Las Vegas and if you lose the game you’re gonna have to run down the street without your shirt like the game. And the stakes have changed and we’re still pretending like we’re playing penny slots and we’re not anymore.

Corey: That’s a good way of framing it. I still remember some of my biggest observability challenges were building highly available our syslog clusters so that you could bounce a member and not lose any log data. Because some of that was transactionally important and we’ve gone beyond that to a stupendous degree. But it still feels like you don’t wind up building this into the application from day one more’s the pity because if you did and did that intelligently, that opens up a whole world of possibilities. I dream of that changing where one day whenever you start to build an app, oh and we just push the button and automatically instrument with tel. So you instrument the thing once everywhere it makes sense to do it and then you can do your vendor selection and what you send where decisions later in time. But these days we’re, we’re not there.

Working with legacy systems and technology democratization

Rachel: Well I mean and there’s also the question of just the legacy environment and the tech debt. Even if you wanted to [have observability]. Actually I was having a beer yesterday with a friend who’s a VP of engineering and he’s got his new environment that they’re building with observability instrumented from the start. How beautiful they’ve got Opentelemetry (OTel), they’re gonna have tracing and then he’s got his legacy environment, which is a hot mess. 

So you know, there’s always gonna be this bridge of the old and the new, but this is where it comes back to no matter where you’re at, you can stop and think “what are we doing and why? What is the cost of this?” And not just cost in dollars, which I know you and I could talk about very deeply for a long period of time, but like the opportunity cost developers are working on stuff that they could be working on something that’s more valuable or like the cost of making people work round the clock, trying to troubleshoot issues when there could be an easier way. So I think it’s like stepping back and thinking about cost in terms of dollars, cents, time, opportunity, and then also impact. It’s starting to make some decisions about what you’re gonna do in the future that’s different. Once again, you might be stuck with some legacy stuff that you can’t really change that much but you gotta be realistic about where you’re at.

Corey: I think that it is a hard lesson to be very direct in that companies need to learn it the hard way for better or worse. Honestly, this is one of the things that I always noticed in startup land where you had a whole bunch of, frankly relatively early career engineers in their early twenties, if not younger, but then the ops person was always significantly older because the thing you actually want to hear from your [operations] person, regardless of how you slice it, is: “I’ve seen this kind of problem before. Here’s how we fixed it. Or even better, here’s a thing we’re doing and I know how that’s going to become a problem. Let’s fix it before it does.” 

Rachel: Yeah, that’s an interesting point you make and it kind of leads me down this little bit of a side note but the really interesting anti-pattern that I’ve been seeing in a lot of companies is that more seasoned operations person, they’re the one who everyone calls when something goes wrong. Like they’re the one who is like,”Oh my god, I don’t know how to fix it. This is a big hairy problem. I call that one operations person or that very experienced person.” That experienced person then becomes this huge bottleneck into solving problems that people don’t really, they might even be the only one who knows how to use the observability tool. 

If we can’t find a way to democratize our observability tooling a little bit more so like just day-to-day engineers, like more junior engineers, newer ones, people who are still ramping can actually use the tool and be successful. We have a big problem when these ops people walk out the door, maybe they retire, maybe they just get sick of it. We have these massive bottlenecks in organizations, whether it’s operations or DevOps or whatever that I see often exacerbated by observability tools. 

Going beyond tool selection

Corey: On some level, it feels like a lot of these things can be fixed with tooling. And I’m not going to say that tools aren’t important. Have you ever tried to implement observability by hand? it doesn’t work. There has to be computers somewhere in the loop if nothing else and then it just seems to devolve into a giant swamp of different companies doing different things, taking different approaches. On some level, whenever you read the marketing or hear the stories any of these companies tell, you almost have to normalize it from translating from whatever marketing language they’ve got into something that comports with the reality of your own environment and seeing if they align. That feels like it is so much easier said than done.

Rachel: This is a noisy space that is for sure. I think we could go out to 10 people right now and ask those 10 people to define observability and we would come back with 10 different definitions and then you throw a marketing person in the mix, right? 

But like I said a minute ago, the answer isn’t tools. Tools can be part of the strategy. But if you’re just thinking, “I’m gonna buy a tool and that’s going to solve my problem,” you’re gonna end up like this company I was talking to recently that has 25 different observability tools and not only do they have 25 different observability tools, what’s worse is they have 25 different definitions for their service-level objectives (SLOs) and 25 different names for the same metric. To be honest, it’s just a mess. I’m not saying go be draconian and you know, tell all the engineers like you can only use this tool and we use that tool. You gotta figure out this kind of balance of like hands on, hands off, you know, how much do you centralize, how much do you push and standardize. Otherwise you end up with just a huge mess.

Corey: On some level it feels like it was easier back in the days of building it yourself with Nagios because there’s only one answer and it sucks unless you wanna start going down the world of HP OpenView, which is step one. Hire a 50 person team to manage OpenView it. Okay, that’s not gonna solve my problem either. So let’s get a little bit more specific. How does Chronosphere approach this? 

Because historically when I’ve spoken to folks at Chronosphere, there isn’t that much of a day one story of “I’m gonna build a crappy web app, let’s instrument it for Chronosphere.” There’s a certain “You must be at least this tall to ride” implicit expectation built into the product just based upon its origins. I’m not saying that doesn’t make sense, but it also means there’s really no such thing as a greenfield buildout for you either.

Rachel: Well, yes and no. I mean I think there’s no greenfield out there because everyone’s doing something for observability, monitoring or whatever you wanna call it, right? Whether they’ve got Nagios, whether they’ve got Datadog, whether they’ve got something else in there, they have some way of introspecting their systems, right? 

So one of the things that Chronosphere is built on, that I actually think this is part of a way you might think about building out an observability strategy, is this concept of control and open source compatibility. So we can only collect data via open source standards. You have to send us data via Prometheus via open telemetry. It could be older standards like StatsD or Graphite, but we don’t have any proprietary instrumentation. If I was making a recommendation to somebody building out their observability strategy right now, I would say open [source] all day long.

Because that gives you a huge amount of flexibility in the future because guess what? You might put together an observability strategy that seems like it makes sense for right now. I was talking to a B2B SaaS company that told me that they made a choice a couple years ago on an observability tool. It seemed like the right choice at the time. They were growing so fast, they very quickly realized it was a terrible choice, but now it’s gonna be really hard for them to migrate because it’s all based on proprietary standards. Of course, a few years ago they didn’t have the luxury of OTel and, and all these, but now that we have this, we can use these to kind of future proof our mistakes. So that’s one, one big area that once again, both my recommendation and happens to be our approach at Chronosphere.

Corey: I think that that’s a fair way of viewing it. It’s a constant challenge too just because you mentioned Datadog earlier, for example. I will say that for years I have been asked whether or not at the Duckbill Group we look at Azure bills or GCP bills. Nope. We are pure AWS. Recently we started to hear that same inquiry specifically around Datadog to the point where it has become a board level concern at very large companies. And that is a challenge on some level. 

I don’t deviate from my typical path of fixing AWS bills and that’s enough impossible problems for one lifetime. But there is a strong sense that you want to record as much as possible for a variety of excellent reasons, but there’s an implicit cost to doing that. In many cases the cost of observability becomes a massive contributor to the overall cost. Netflix has said in talks before that they’re effectively an observability company that also happens to stream movies just because it takes so much effort engineering and raw computing resources in order to get that data and do something actionable with it. It’s a hard problem.

Rachel: It’s a huge problem and it’s a big part of why I work at Chronosphere to be honest. When I was, you know, towards the tail end at my previous company in cloud cost management, I had a lot of customers coming to me saying, “Hey, when are you gonna tackle our Datadog or our New Relic or whatever, similar to the experience you’re having now.” 

Corey, this was happening to me three, four years ago and I noticed that there was definitely a correlation between people who are having these really big challenges with their observability bills and people that were adopting Kubernetes and microservices and cloud native. 

It was around that time that I met the Chronosphere team, which is exactly what we do, right? We focus on observability for these cloud native environments where observability data just goes like wild, we see 10x to 20x as much absorbability data and that’s what’s driving up these costs. And yeah, it is becoming a board-level concern. I mean, and coming back to the concept of strategy, like if observability is the second or third most expensive item in your engineering bill, like obviously cloud infrastructure number one, number two and number three is probably observability. How can you not have a strategy for that? How can this be something the board asks you about and you’re like, what are we trying to get out of this? What’s our purpose? [It’s] troubleshooting.

Getting observability “out of the box”

Corey: Right? Because it turns into business metrics as well. It’s not just about is the site up or not? One of the things that always drove me nuts in, not just in the observability space, but even in cloud costing is where, your costs have gone up this week, so you get a frowny face or it’s in the red, but for a lot of architectures and a lot of customers, that’s because you’re doing a lot more volume that translates directly into increased revenues, increased things you care about. You don’t have the position or the context to say that’s good or that’s bad. It simply is. You can start deriving business insight from that. And I think that is the real observability story that I think has largely gone untold at tech conferences at least.

Rachel: It’s so right. I mean, spending more on something is not inherently bad if you’re getting more value out of it. And definitely a challenge on the cloud cost management side, my costs are going up but my revenue’s going up a lot faster, so I’m okay. I think [in] some of the plays, we put observability in this box of “it’s for low level troubleshooting,” but really if you step back and think about it, there’s a lot of larger, bigger picture initiatives that observability can contribute to in an organization, like digital transformation. 

I know that’s a buzzword, but like that is a legit thing that a lot of CTOs are out there thinking about. How do we get out of the TechNet world and how do we get into cloud native? Maybe it’s developer efficiency. There’s a lot of people talking about developer efficiency at KubeCon, which was one of the big, big topics. To me we’ve put observability in a smaller box and it needs to bust out. I see this also in our customer base. We have customers like DoorDash that use observability not just to look at their infrastructure and their applications, but also look at their business at any given minute. They know how many dashers are on the road, how many orders are being placed, cut down to the second. And they can use that to make decisions.

Corey: This is one of those things that I always found a little strange coming from the world of running systems in large environments to fixing AWS bills. There’s nothing that even resembles a fast reactive response in the world of AWS billing. You wind up with a runaway bill, they’re gonna resolve that over a period of weeks [during] business hours. 

If you wind up spinning something up that creates a whole bunch of very expensive drivers behind your bill, it’s gonna take three days in most cases before that starts showing up anywhere that you can reasonably expect to get at it. The idea of near real time is a lie. Unless you wanna start instrumenting everything that you’re doing to trap the calls and then run cost extrapolation from there. That’s hard to do. 

Observability is a very different story where latencies start to matter, where being able to get leading indicators of certain events be they technical or business start to be very important. But it seems like it’s so hard to wind up getting there from where most people are. I know we like to talk dismissively about the past, but let’s face it, conference wear is the stuff we’re the proudest of. The reality is the burning dumpster of regret in our data centers that still also drives giant piles of revenue. So you can’t turn it off or nor would you want to, but you feel bad about it as a result. It just feels like it’s such a big leap.

Rachel: It is a big leap. I think the very first step I would say is trying to get to this point of clarity and being honest with yourself about where you’re at and where you wanna be. And sometimes not making a choice is a choice. So sticking with the status quo is making a choice. As we get into things like, like the holiday season right now, and I know there’s gonna be people that are on call 24/7 during the holidays potentially to keep something that’s just duct taped together, barely up and running. I’m making a choice, you’re making a choice to do that. I think that the first step is at least acknowledging where you’re at, where you wanna be and if you’re not gonna make a change, just understanding the cost and being realistic about it.

Corey: Yeah, being realistic I think is one of the hardest challenges because it’s easy to wind up going for the aspirational story of the future when everything’s great. I appreciate that you need to plant that flag on a hill somewhere. [But] What’s the next step? What can we get done by the end of this week that materially improves us from where we started the week? And I think that with the aspirational conference wear stories, it’s hard to break that down into things that are actionable that don’t feel like they’re going to be an interminable slog across your entire existing environment.

Rachel: No, I get it. And for things like, you know, instrumenting and, and adding tracing and adding OTel a lot of the time, the return that you get on that investment is, it’s not quite like I put a dollar in, I get a dollar out. I mean something like tracing, you can’t get to 60% instrumentation and get 60% of the value you need to be able to get to like 80, 90% and then you’ll get a huge amount of value. So it’s sort of like you’re trudging up this hill, you’re trudging up this hill and then finally you get to the plateau and it’s beautiful, but that hill is steep and it’s long and it’s not pretty. I don’t know what to say other than there’s a plateau near the top and there’s companies that do this. We’ll really get a ton of value out of it and that’s the dream that we wanna help customers get up that hill. But I’m not going to lie; the hill can be steep. 

Corey: One thing that I find interesting is there’s almost a bimodal distribution in companies that I talk to. Chronosphere is a good example of this. Presumably you have a cloud bill somewhere and the majority of your cloud spend will be on what amounts to a single application, probably in your case called, I don’t know, Chronosphere, it shares the name of the company. The other side of that distribution is the large enterprise conglomerates where they’re spending, I don’t know, $400 million a year on cloud, but their largest workload is 3 million bucks and it’s just a very long tail of a whole bunch of different workloads, applications, and teams. 

What I’m curious about from the Chronosphere perspective of the product. It feels easier to instrument a Chronosphere-like company that has a primary workload that is the massive driver of most things and get that instrumented and start getting an observability story around, [rather]  than it does to try and go to a giant company and [have] 1500 teams need to all implement this thing that are all going in different directions. How do you see it playing out among your customer base if that bimodal distribution holds up in your world?

Rachel: It does and it doesn’t. So first of all, for a lot of our customers, we often start with metrics, and starting with metrics means Prometheus, and Prometheus has hundreds of exporters. It is basically built into Kubernetes. So if you’re running Kubernetes, getting Prometheus metrics out, it’s actually not a very big lift. If we start with Prometheus, we start with getting metrics in and we can get a lot. We have a lot of customers that use this just for metrics and they get a massive amount of value, but then once they’re ready they can start instrumenting OTel and start getting traces in as well. In larger organizations it does tend to be one team, one application, one service, one department that kind of goes at it and gets all that instrumented. But I’ve even seen very large organizations when they get their act together and decide like, no, we’re doing this, they can get OTel instrumented fairly quickly.

It’s just like lining [things] up. It’s more of a people issue than a technical issue a lot of the time. Like getting everyone lined up and making sure that [everyone agrees and they’re all] on board, But it’s usually a “start small” [project] and doesn’t have to be all or nothing. We also just recently added the ability to ingest events, which is actually a really beautiful thing and it’s very, very straightforward. 

It basically is just, we connect to your existing other DevOps tools. Whether it’s like a BuildKite or a GitHub or a LaunchDarkly, and anytime something happens, one of those tools gets registered as an event in Chronosphere and then we overlay those events over your alerts. 

So when an alert fires, then the first thing I do is I go look at the alert page and it says, “Hey, someone did a deploy five minutes ago, or there was a feature flag flipped three minutes ago.” I solved the problem. Right then. I don’t think of this as an all-or-nothing nature to any of this stuff. Yes, tracing is one of those things where you have to make a lot of investment before you get a big reward, but that’s not the case in all areas of observability.

Accelerating observability benefits with Chronosphere

Corey: Yeah, I would agree. Do you find that there’s a significant easy early win when customers start adopting Chronosphere? Because one of the problems that I’ve found, especially with things that are holistic and as you talk about tracing, well you need to get to a certain point of coverage before you see value. But human psychology being what it is, you kind of wanna be able to demonstrate, oh see the meantime to dopamine needs to come down – to borrow an old phrase. Do you find that some of the easy wins start to help people to see the light? Because otherwise it just feels like a whole bunch of work for no discernible benefit to them?

Rachel: For the customer base, one of the areas where we’re seeing a lot of traction this year is in optimizing the cost. Like coming back to the cost story of their overall observability bill. So we have this concept of the Control Plane in our product where all the data that we ingest hits the Control Plane. At that point, the customer can look at the data, analyze it and decide this is useful, this is not useful. And actually not just decide that, but we show them what’s useful, what’s not useful, what’s being used, what’s high in cardinality, [or] in high cost but [underused]. Then we can make decisions around aggregating it, dropping it, combining it, doing all sorts of fancy things, changing the, you know, downsampling it. We can do this on the trace side, we can do it both head-based and tail-based on the metrics side.

It’s as it hits the Control Plane and then streams out and then they only pay for the data that we store. So typically customers are, they come on board and immediately reduce their observability data set by 60%. Like that’s just straight up. That’s the average. I’ve seen some customers get really aggressive, get up to like [90%], where they realize we’re only using 10% of this data, let’s get rid of the rest of it. We’re not gonna pay for it. So paying a lot less helps in a lot of ways. 

It also helps companies get more coverage of their observability and their overall stack. So I was talking recently with an autonomous vehicle driving company that recently came to us from the dog and they had made some really tough choices and we’re no longer monitoring their pre-prod environments at all because they just couldn’t afford to do it anymore. It’s like, well now they can and we’re still saving them money.

Corey: I think that there’s also the, the downstream effect of the money saving too, that, for example, I don’t fix observability bills directly, but why is your CloudWatch bill through the roof for data egress charges? In some cases it’s because your observability vendor is pounding the crap out of those endpoints and pulling all your log data across the internet. That tends to mean it’s not just the first order effect, it’s the second, third, and fourth order effects it winds up having. It becomes almost a holistic challenge, 

Rachel: Yeah, I would agree with that. I think that just looking at the bill from your vendor is one very small piece of the overall cost you’re incurring, I mean all of the things you mentioned, the egress, the CloudWatch, the other services it’s impacting. What about the people?

Corey: Yeah, it sure is great that your team works for free

Rachel: It makes me think a little bit about that viral story about that particular company with a certain vendor that had a $65 million per year observability bill. And that impacted not just them, but it showed up in both vendors’ financial filings, like how did you get there? How did you get to that point?I think this all comes back to the value in the ROI equation. We can all sit in our armchairs and be like, “well that was dumb,” but I know there are very smart people there that just got into a bad situation by kicking the can down the road on not thinking about the strategy.

Corey: Absolutely. I really wanna thank you for taking the time to speak with me about, I guess the bigger picture questions rather than nuts and bolts of a product. 

Rachel: Thank you, Corey. Always fun.


Share This:
Table Of Contents