With how much observability data cloud native environments emit, it’s hard for teams to uncover what data to ditch. Central observability teams are frequently battling costs, and developers aren’t sure what data to hold on to.
How can organizations find harmony between teams while meeting their business needs? In this Chronosphere demo, hear from our Senior Product Marketer Scott Kelly, and Sales Engineer Simon Donahue, on how Chronosphere’s Control Plane helps teams analyze and understand the value of their observability data, and shape that data to maximize the value it delivers.
If you don’t have time to sit down for the whole on-demand demo, catch up on the key points below.
Analyze and shape data
Scott: Chronosphere is a single tenant SaaS observability platform that’s purpose-built for cloud native. It’s 100% open source compatible, and is built on a highly scalable data store, M3DB. The founders of our company came from Uber and that’s what they built [at Uber] to have as their monitoring platform.
At the center of our platform is this Control Plane, which allows you to do things like analyze and shape your data. That’s really where we’re going to spend our time today, and deep dive into how this helps you analyze, shape your data, and how it helps you justify using a data-driven approach to understand the value of your metrics and the value they deliver within your organization.
I think this is a time tested problem, right? Everybody knows that there’s monitoring data that is not very useful. The problem is that it’s difficult to identify and it’s time consuming. While costs rise, and with cloud native causing [it] to happen exponentially, it’s really hard to do that. The central observability team is trying to control costs. The developers are trying to do other things. [Developers] think all their data is important, and they don’t really have the time to go look at that, or spend the time doing reduction projects.
What we’ll show you in the demo today is how to start eliminating waste by identifying unused metrics. That’s the starting point of all this, right? It’s also understanding what the cost versus utility of each metric is, and how to make smarter [data] shaping decisions.
There’s some nuance to this, and Chronosphere provides some capabilities that help you do that. We’re also going to show you how this helps reduce the need to make trade offs. So, if you can reduce your volumes, optimize your data, then you can start to have deeper visibility in things.
We’ll wrap it up with how this gives you this data driven case for justifying your observability spend. You can quantify the cost and the value your metrics and data are delivering, and use that as information to know the value of the data within your observability tool.
Creating harmony between teams
Simon: Observability teams and developers have this kind of relationship where one side says: “I need to understand what my application is doing,” and the other side says: “I need to also have control over the data that I’m ingesting, and how much this is costing me.”
This is hard to maintain because there’s not a lot of transparency around where data is coming from, what data is valuable, what changes should be made to keep people happy and have developers with all of their data — while keeping costs low and things performant.
Chronosphere has tools put in place that let everybody be in harmony without having to firefight and play traffic cop and figure out: “Where can I cut things back [and] who’s producing what?” The dashboard on the screen here is an example of this transparency. This is an out-of-the-box dashboard that we have that we call our usage dashboard.
What this does is break down metrics – whether they’re being ingested, dropped by a policy, actually being persisted, and if things are being dropped by a rate limit. [It shows] what you start with and what you are storing.
This is important because at Chronosphere, we care about the persistent part of the data. That’s what we key in on to charge for. Ingestion, we welcome, but because we give you all these tools, we want you to be able to justify the decisions you’re making, like Scott alluded to.
You can take information like this, and actually break it down however you’d like on a per-label basis – we just took the application name here. You can also understand on a per-service, team, application, whatever the case is, how much is being ingested, persisted, dropped, and limited. This is a good way to get a sense of where my spend is coming from in my observability solution.
On top of this, we have ways to be proactive about making sure that you’re not going past that limit and that you’re somewhat future-proofing yourself from having to deal with certain dimensions of cardinality, or accepting metrics that aren’t within a policy that you can put in place.
Using Quotas to drive accountability
Scott: A good example of this is what we call quotas. Quotas looks at the same data, but it’s letting you put a barrier up that says: “Hey, [for] metrics coming from a certain source, you get a certain amount of persistent metric rights that you’re allowed to use.”
It’s not meant to be a punishment. It’s more so meant to be proactive about understanding: “Am I getting close to my capacity that I have?” and somewhat delegating some responsibility back to the developers. The tools we’re going to show in a second here to actually dig in to say: “Why is a quota starting to be breached? Are we getting to our threshold?” It could be used by either team. It’s kind of up to you to sort of determine who might have some responsibility in this.
To get into that part of the demo, we’re looking at this quotas dashboard. It’s a metric that we’re tracking. Thus, we can [have an] alert on it. If somebody is starting to get close to their quota limit, you can send them information that says: “You’re getting close, here’s a runbook”, or you can use some tools to sort of analyze what you’re making, and then can use that information to justify some rules, aggregation, and policies in place to tone down what we’re actually storing.
Real-time analysis using the Metrics Traffic Analyzer
Simon: We’ll take the ordering service as our example. I’m responsible for the ordering service as a developer. I get an alert. And either I can go through and try to figure out what’s going on, or perhaps the responsibility is on the SRE team, however your organization decides to do it. But within Chronosphere, there’s some tools to help us figure out why this quota is being crossed. This is going to bring us to our Metrics Traffic Analyzer. You’ll notice that there’s a filter here already for the ordering service — the one that crossed the threshold. If I push Play here, this is going to show us a real time streaming view of all metrics being ingested by Chronosphere, but not yet being persisted, which is key.
What this means is that you can use this to get insight into what’s even going on, but then use that as a way to make rules that are accurate. On the left hand side, these are labels or tags, whatever syntax you tend to use, we’ll show the unique values of all of them as well, so you can get a sense of the cardinality, and how often they appear in the metrics that we see on the left hand side, which will give us their names and their data points per second (both average and current) that we’re ingesting.
If you’re doing just general cost accounting, this is really nice. It’s also a good way to get a hint about where some of this [data] ingest is coming from, like we see in this instance label. If you’re familiar with Prometheus, you’ll probably recognize this one. Then the question is, “How do we figure out if this is actually being used?” [and] “Can we justify this label being such a high cardinality and keep it, or can we simply make some policies to get rid of it?” Which in comes our Metrics Usage Analyzer.
Understanding cost and value using the Metrics Usage Analyzer
The Metrics Usage Analyzer can look at all of the usage of your metrics in two different ways: One by the metric and one by the label. We’ll look at both of these, but to get into what’s going on here, we’re looking at everybody in the tenant who is querying the metrics, using them on dashboards – are they configured on dashboards – and breaking this down in a couple ways to make it easier to surface this information.
I could come into [the Metrics Usage Analyzer] and see what are my most utilized metrics and use this as a starting point. Now, on the other screen, I’m going to just pick one of the metrics that belong to that service to keep it easy. One of those was our service HTTP. Request time total, we’ll use this as our example metric here.
I can see at the top, some details in our summary. I can see where this metric is being configured. [Then I can ask] are there already rules being applied to it? Are people executing queries against it directly in our query explorer? And then we score it based on those two variables on top of how many people are actually using it. This is a really good way, high level [to see]: “Is this actually being useful or not? Are people using this at all?”
Continue this part of the conversation at minute 11:57
Cutting costs with guardrails
Simon: I wanted to give an example of what [this feature] can do for performance. With Chronosphere, a lot of this is cutting costs. Like I mentioned earlier, we charge for persistent data. But this has a secondary benefit, which is our performance. I’m going to [show] a cluster health dashboard here.
This is meant to be representative of how aggregation can impact your dashboard performance, if we consider these two columns here. On the left hand column, we’re looking at all of the data on the left with an aggregate version of a time series. On the right hand side, this is looking at the raw version, but still retaining the instance label, among other things. You might have noticed that the left hand side loads much faster than the right hand side.
Just to demonstrate, if we look at two days worth of data, you’ll see on the left hand side, this took maybe half a second. And on the right hand side, this is going to continue spinning until it actually loads on the screen.
But, we’re looking at the same exact time series and the same exact data on this page. This is really one of the benefits that you get from being able to delegate this responsibility, and also have full control over it on the server side, and not necessarily have to go back to developers and tell them to change things or down sample, and let them be a little bit less in control of what they’re sending you; they get to produce everything that they want. If they would think all of these metrics are valuable, and maybe they are, you now have tools over here to actually figure out if that’s the case, and thus you can cut back on costs and also increase performance at the same time.
Scott: The thing that’s interesting about the quotas that we talked about earlier is that it’s starting to solve that age-old problem that we talked about at the beginning. The developers and the engineers that are creating these metrics now have some insight and understanding of how they’re impacting the system, and now they’re starting to be more thoughtful about it.
That’s one of the important things that we think about when controlling observability costs. [Developers] know their data the best. Especially at the rate that data can grow and how it can explode from a cardinality perspective, having them have some understanding, and having them have some guardrails helps to get them thinking about what’s important to them, and what’s not important.
That makes the organization, together, start to focus on this growth problem and not just become a central observability team issue. It allows everybody to be more collaborative, and do what’s in the best interest, and be smart about how they’re shaping their data so that they can build that space into that budget to be able to do more visibility, more custom metrics, [and] more experiments. It’s giving those development teams who typically have just been the creators and not the “understanders” or the “controllers,” it gives them that control now.
Continue to the Q&A portion of this conversation at 17:24.