Eric: Hello, my name is Eric Schabell, I work for Chronosphere and today we’re going to talk about optimizing observability spend with a slight focus on metrics. This is a DataOps Day 2023 talk. Let’s get to it.
So, in the beginning when we talked about observability and what was going on with monitoring applications and how the cloud got super expanded as it is now, a lot of this was in monolithic applications, VM environments. You saw something like this, where you had a pretty good idea of your environment, you had the sizing scoped out, you knew exactly the data you were going to get, relatively speaking, and you had this sort of oasis of monitoring and observability that you were able to get control of.
The experience was not so awful to have to carry a pager around these days. It’s just the general consensus of some different people. The general consensus is we had this stuff under control. Then you get to the cloud native environments, moving your company into the cloud and scaling into the cloud where things happen in a dynamic fashion.
We see that the amount of data that is generated tends to feel a lot like this picture. While this seems awfully extreme, it’s not far off, unfortunately. Definitely share some stories of pandemic enterprises that exploded during the pandemic, right? That their business model just jacked right up into unforeseen amounts of data and production environments.
One of the things that is happening, and you see a lot of right now, is it’s quite common to run into environments where people are struggling massively with the amount of observability data that’s being generated. And they find out that the price point is surpassing their production environment.
If you’re into production, we’re doing infrastructure and stuff to make our customers happy, and to make our business model work. Then, all of a sudden all the observability data and input and organizational stuff is just surpassing that, and that’s something that’s making people scratch their heads like: “full stop, wait a minute. What’s going on with this stuff?” And just to give you a real quick idea of how this is possible and why this kind of stuff is happening: If you go and look around on the internet a little bit, you’ll find lots of examples of some kind of baseline cloud native applications. So, we’re talking about a Kubernetes cluster, something that auto scales, something that’s been set up on something as simple as this, a little “Hello World” web app, some frontend code to ping it and act like there’s actual users hitting it.
And they did tracing, end user metrics, logs – stuff you see right there. Tracing and collecting all this and gathering the metrics observability data for 30 days resulted in almost a half a year by the end. That’s just one application. When you look at this at large scale, and imagine this going across multiple applications, and you come to think that the average organization stores this data for 13 months, you can all of the sudden see that the price points are going to be just crazy compared to what they were before.
There’s a link in the speaker notes, and these slides will be online later, to trace down how this exactly works. I love this slide because if you have to ask, you can’t afford it. We wouldn’t care what it costs. Organizations wouldn’t care if they had better outcomes, if their customers were happy, and if they were able to triage incidents quicker.
But, that is not the case. It’s spiraling out of control that they’re being flooded with this data and having to pay more for it – because you’re paying for the storage aspect with most of the vendors out there – more so than the adjustment side. And it’s just a lot of useless data that’s getting in the way. More is not better. It just slows down the loading of your dashboards, it slows down the tooling you’re using, and it’s really hard for your people to triage. What you see here is that we have an observability report from 2023, that we gathered data from 500 different engineers out there working. One of the most shocking things that people don’t think about is, you also have the data costs and the aspects of the business that you notice climbing in your bills.
What you may not be thinking about, is what it’s doing to your resources, your engineers, on-call staff. If you see that 10 hours a week, on average, is being spent triaging and understanding incidents, that’s a quarter of a 40 hour work week. That’s 25 percent of your time spent doing stuff you don’t really want to be doing, right?
And if you’re a developer, and a DevOps team doing this kind of stuff, you’re going to start scratching your head and wondering why you’re actually here. A lot of this leads to statistics that people just don’t want to work there anymore. And if you can’t retain your resources, this is going to become a massive problem, too.
On top of that, you’re going to end up with burnout. It’s just that simple. So, knowing the cost of your observability and metrics data – it’s a combination of the flood of data that’s coming in, not being able to tackle what’s happening with the massive amounts that are coming in, what you can do with that, and also not being able to take care of your resources because they’re being overworked and spending too much time messing around with stuff.
So, who’s responsible for that is what we’re looking at here. Is that developer making decisions that you see here saying: “Hey, I’m going to use this tool,” or “Hey, I’m going to do this, or I’m going to put this new metric in my tracing code?” “How am I going to instrument my applications, who’s making these decisions, and who owns those?”
That’s becoming the problem when you see that that cost point is going above your production. You don’t know who made those decisions, you don’t know what happened. So, what you saw was in 2023, like 80 percent of these organizations will be having a dedicated FinOps.
We’re pretty much at that point now where you see the FinOps foundation, you see a lot of this stuff being discussed. A lot of references to looking for a way to put processes and phases and various tooling in place to have somebody that can track this down with their cloud usage, be able to keep an eye on and manage quota and budgets, resources that they have in their files. This is a big deal. And that brings me to something that we want to talk about for the rest of this session. We’ve come up together with a bunch of our customers – something that is a little bit recognizable from the FinOps Foundation and how the phases and stuff work.
But, you see that observability data optimization cycle is something that you can use to put in place in an organization, to kind of understand the data value and usage. You can use it to shape and transform your data. You can have a centralized governance around this. And then, you can continuously adjust for efficiency.
It gives you a life cycle around what you’re doing with your cloud data. In this case, we’re going to zoom in on this specific storyline and go through the cycle, with an eye on just metrics. Also, be aware that we’re going to look at the governance, analyze, refine, and operate phases.
It’s a rather short session we have today, so I will only dive deep into the analyze [portion] with a few concrete examples of what that looks like. So, if you have interest in more stuff, there’s links at the end, and you can get in touch. We can show you some more stuff around governance defined in the operate.
The first thing here is the governance – where it starts. You see that centralized governance is something that you need to have in place to give your teams ownership and control of their metrics and understand what is happening with the cardinality, the amount of growth that they’re doing when they’re deploying new applications or putting new metrics in place for observability. You see that there’s a couple of things in place here. One of them is quotas and the other is priorities. Quotas gives you the ability to basically allocate your usage across different teams. It’s basically giving them a checklist. It’s not blank anymore. It’s been signed and gives them a certain amount to respond, speak or just whatever you’re doing.
You also see that you can prioritize which data is impacted, giving you the ability to govern how you’re going to be using your cloud data to be able to take some sort of guardrails in place for if something happens – like a cardinality explosion through a development cycle releases something in an application that just spikes it out of control.
There is tooling behind this, we’re not going to cover that in great detail. We’re going to go to the next one here with analyze. This is probably one of the most important phases, which is why I want to spend a little bit of time on the basics, to dive down into it. Being able to understand the value of observability data to identify what’s useful and what is waste.
Straight up, this just means having a lot of data being ingested into your organization, into your observability pipeline. That’s what everybody has to pay for. But coming out the back end into your storage, that’s another aspect you have to pay for – what you want to have back here and what you store there is, what you’re also querying and using in your dashboards, your ad hoc queries, and setting alerts on.
So, by not overpopulating that with useless data that has no impact on any of these alerts, dashboards, ad hoc queries, or users touching it, it’s greatly to your advantage, not only in the price point, but also in the performance of your dashboards, the ability of your on-call crews to be able to handle the data, find the things you want to find.
You see here, we have a metrics traffic analyzer that’s watching the ingested data live, so it lets you explore that real time and see what’s going on. We have something called a metric usage analyzer, which filters up everything that’s coming in and shows you what’s the least used.
Then, you have a trace analyzer, which we will not dive into. It’s the same idea: covering traces, spans, and analysis. So, let’s take a quick look at the traffic analyzer you see here that shows the incoming stuff live, gives you a chance to break down the biggest and the smallest contributors, metric names, labels, applications.
It is the direct way to troubleshoot cardinalities, but immediately seems something goes a little bit too rough. Go a little bit closer here, see that you have a live view, or you can actually pause it if you want to. You view the traffic before it’s stored, make decisions about the traffic, how you want to shape that, aggregate some stuff away or drop it before you actually have to pay for it. You can also break it down by labels or by metrics; this instance label, 100 percent on these metrics and has 62 unique values. You can select that instance label and look down into the values of what’s going on there – see that some of them are not so interested.
Take action. If we go to the usage analyzer, this is after you’re going to go to the live data, and now … Filter out the stuff from least important to most important. What it’s looking at is what kind of value this metric delivers. By default, it’s going to show you the ones that it’s absolutely delivering.
See here, in this case, you’re looking at one that has zero references. It hasn’t been directly queried. It hasn’t been used in any kind of alert or dashboard. It has a utility score of zero, and it has 24,000 data points per second. It’s quite a lot of data to be storing for no use whatsoever. You see here that you can sort it. By default, it’s like least utilized. Find something that’s interesting you want to look at, click on users details, down in the “more”, you can see who’s using it or what is using it. So, it’s either in a dashboard, monitor, recording rule, drop, or aggregator. Is it in a dashboard of what users have looked at this thing?
You also have the ability to select a time range. You can see the last 30, 40 days. You’re able to put this a little bit into context. So, they have a utility score. For the data points per second, you can see utility scores based on utilization … Let’s take a scenario here where we dig down into one of these metrics and we see that it has a really low utility score so it’s not being utilized a lot. But it has a lot of data points – which is kind of off, right?
How is this being used? We click on the usage details up on the right there, take a closer look at it. You see here, it’s not really being used anywhere, it’s not in any dashboards, monitors, rules, it’s being ad hoc explored by two unique users. That’s weird.
Look at what those users are doing. Oh, we know these two guys. So, here are two of our top site reliability engineers (SREs). These guys must be using it for a reason, because they’ve looked at it more than once, obviously. We see 800 executions on the one by Joshua, or Joseph. Maybe others might want to be using that as well.
So, we found an underutilized metric that two guys think are super important. Maybe we should give them a call and find out what’s going on with that. See if maybe we don’t need that.
If we move to the refining section here, you can see that the refine phase has to do with being able to take action without touching the source code or redeploying. So, being able to analyze and find these various problems is fantastic, but we don’t want to have to go out and redeploy or reach out to a development to take care of the problem.
That will happen eventually maybe, but we initially wanted to have your on-call teams be able to take action right there and right now, because your budgets are going through the roof, and costs are going through the roof when large cardinality spikes happen. The ability to take action on the adjusted metrics right there, before they’re being stored at the back end, without touching the source code and re-deploying.
We give you the options here to aggregate, down sample your data, to remove them by removing high cardinality labels or dropping non valuable data. This is at the ingest again. So, this will all help you reduce costs and improve performance without any kind of alert query.
Then, we get to the “operate” section here. It has a built in capability to ensure that your queries are performing – [and] require no user intervention. The ability to use, to write queries and to query your data, it’s fantastic. But that’s not always an easy thing to do, so there’s something called a query accelerator that optimizes that automatically to ensure that while you’re querying to fill dashboards to show data and time series samples to investigate problems, you don’t have to manually optimize that stuff. It will keep your stuff performing automatically – a pretty neat feature. There’s also a query scheduler that ensures that you’re sharing the resources while you’re doing these queries, so it’s not just everybody on top of one group where a user can crowd out the rest.
Then, there’s a shaping rules UI. This is helping you understand some of the refining and shaping that you’ve been doing with your metrics as they came in, and you decided to aggregate or whatever.
And that brings us to the “why are we optimizing the observability spend? What are we looking at here?” In a recent study by ESG, 69 percent of these companies are concerned with the rate of their data growth. When you’re able to control and optimize your data, expanding the visibility and coverage that you have across your infrastructure, what’s happening, it’s also increasing the instrumentation of customer experience.
So, the focus goes from technology, to instrumenting your customer experience, which is what we’re all here for anyway, right? Trying to get the business outcomes that we want. And then, freeing up your observability time, the team’s time to tackle real projects and strategic stuff, instead of doing those 10 hours a week on troubleshooting issues is all just a good example of what you’re trying to get rid of. The need is real – the need to understand and have the ability to tackle some of this stuff with that big flood of data, just so much of it is just varying bits.
Again, I said this whole framework and stuff, we develop with our customers and give an example of how it’s impacted them directly. You can see that at Snap, they were able to boost their data volume by 50%, and reduce on-call pages by 90%. Abnormal security has that 80 or 98% data volume reduction eight times as fast.
And you see Robinhood, the financial trading application. They have a data volume reduction by 80%. Related improvements, eight times as fast. You see that a lot of this has to do with that there’s so much data and so many systems flooding in. It’s hard to imagine that 50, 80, 98% – but these are the kind of things that are being automatically generated that mean nothing to your business.
In this case, in Robinhood, that’s pretty extreme. Not normal, that’s really extreme. But, it’s how it goes. And here’s what Robinhood said – one of the senior staff engineers at Chronosphere got all this data. Proof of liability and performance, but we’re also saving millions of dollars. I can imagine we can reduce your observability data by more than 80%.
I’d like to share a few links here. These slides will be online at schabell.org. They went live before this started. If you go over there, you should be able to find them – resources, case studies, and a link where you can see some of the other aspects of that life cycle.