With modern cloud native architecture, data growth runs rampant. Remediating issues faster, while reducing engineering burnout and overall observability spend is key in today’s economic climate and high customer expectations.
In this demo, hear from Chronosphere’s Senior Sales Engineer Vince Sarkisian, on how to best analyze which data to keep and what to discard, improve performance, reduce costs, and make developer’s lives easier. He walks attendees through some of the top challenges that developers face when it comes to productivity and what Chronosphere features can specifically address them to improve overall workflows.
If you don’t have time to sit down for the full chat, check out the transcript below.
How Chronosphere supports developer productivity
Vince Sarkisian: I just want to set the stage and talk about what Chronosphere is, and what makes us different. Chronosphere is a [cloud native] observability platform that ingests metrics, traces, and events, and stores them for your developers and engineers to dashboard alert, and really find value in that observability data.
The key thing that I want you to take away from this slide is our commitment to open source compatibility. Chronosphere was built on open source standards. You see the Prometheus icon here, as well as the OpenTelemetry icon — so you can find the strongest open source projects for metrics and traces.
We also do support ingesting Graphite, StatsD, as well as Jaeger and Zipkin. So, we’re really built on those open source standards so that all of your [instrumentation] is vendor agnostic, and you are just sending us your data to reduce that vendor lock in. The other thing that you would want to focus on in this info chart, is our ability to manage and control costs. With most observability platforms, you are collecting the data, and sending it to the datastore. From the datastore, you can analyze and process the data needed.
Chronosphere’s Control Plane and PromQL
Vince: With Chronosphere, we have a Control Plane in front of our datastore. What this Control Plane allows you to do is analyze, shape, delegate, and optimize the data as it streams in, before the persistence layer. It really gives you a lot of great tools and capabilities to understand your data before you persist it. And the key thing about a streaming platform versus persistent, is you’re only charged for what you persist. So, send us all of your data, understand it, and shape it, and then we only charge you for what you persist.
We really believe that in the observability world, the pricing models are backwards in the sense that everyone is collecting so much data and charging it for all that data instead of flipping that – collecting all that data, understanding what’s valuable, and then only charging you for what is valuable.
Once the data hits the datastore, then we start to build all of our dashboarding, alerting, and our query engine. Again, all open source compatible PromQL, so we can extend those use cases if you want to just offer your customers a PromQL endpoint — or if you just want to build on top of that. Everything can run through our query engine, or we’ll show through our user experience, with our dashboarding and alerting capabilities, to really increase that developed productivity and make high value out of the observability data.
Top 3 hindrances to developer productivity
Vince: Let’s talk about the problems that we’re going to address in this demo. The first one is you have your observability team saying, “Hey developers, our metrics are costing us a lot of money and growing fast. What is it important that we could get rid of?”
This is going to decrease developer productivity because they’re now going to be focused on what data they’re sending, how they’re using the data, instead of being focused on building your company’s products and services. It also decreases the productivity of the observability team because now instead of being able to provide a great platform for observability.
The second problem, [is] you have your developer engineers saying, ”Our observability tools are too complex. When I login, I’m overwhelmed with dashboards and alerts. When I get paged, I don’t know where to go next [for a] root cause analysis [of] the issue.”
This is another thing that takes away from developer productivity, right? You have engineers where it should take seconds or minutes, it’s taking them hours to figure out what’s going on. They’re sifting through tons of dashboards. They’re sifting through tons of data and it takes a lot of expertise to get the result. This also can decrease developer productivity because you’re reliant on your expert level engineers to solve problems and you can’t have that same level of problem solving across all the different skill sets in your company.
The third problem is talking about configuration as code. So, you’ve got the observability team saying: “We need to standardize all configuration management via GitOps workflows using our Terraform provider. But, my developers complain about using config as code. How can I templatize everything and have a good developer experience?”
We really develop everything as configuration as code first, and support that workflow. But we also want to provide value in interacting with UI and connecting those two workflows. So, we’ll show you what we’re doing around our user experience in the UI, as well as connecting that to our configuration as code tools.
Let’s walk through what we’ll show in the demo to solve those problems. The first one aligns with that first problem of the data growth is out of control and how we manage costs. We’ll talk about setting quotas per team, talk about reducing alert fatigue, and how we use our monitors and signal grouping feature to reduce the number of alerts that we have. And then, how we analyze our metrics usage in seconds so that we can quickly identify opportunities to reduce our data footprint and not incur all those costs.
So, we’ll talk about root cause analysis workflow: “How do you go from an alert, a dashboard, to our tracing experience without the burden of context switching?” So again, increasing that developer productivity to solve problems quicker, with a fewer number of engineers.
And finally, talking about those GitOps workflows, how you manage GitOps workflows in Chronosphere via Terraform API or Chronoctl, and what tools are we offering the UI to integrate those workflows.
Setting quotas with flexible workflows
Vince: Let’s dive into the product and see this first hand. The first thing that we mentioned was data growth and how we manage that from the observability team’s perspective, but also from the developer engineering perspective.
The first thing that we want to do is be able to set quotas per team. In this dashboard, we can see that we are setting our different quota pools. So, we have a default quota and then each service has their own quota that they can manage. We also have a development pool quota.
So, separating each of the resources within our platform to accurately give each team the autonomy to manage their own resources. Setting quotas is pretty simple, right? You have a line, but what happens when those quotas are exceeded?
That’s where Chronosphere comes in and gives you the tooling to allow you to set the quotas, but also flexibility around what happens when those quotas are exceeded. The first thing is, we have a default pool. Let’s say a team goes over their quota. They can dip into the default pool as we assign them, and no data is lost or dropped until the total resources are consumed for the default pool.
Another option is to determine which metrics are higher or lower value. In the case that you do hit your limits, you have designated high value metrics that won’t be dropped or rate limited, and then you have lower value metrics. Those will be the first ones to be rate limited or dropped.
Alerting and user experience
Vince: So, setting the quotas. It’s pretty simple, but how do we notify each team and how do we manage the alerts and the user experience so that we can set these quotas, but also give our engineers the tools to effectively manage those quotas? I’m going to shift into our alerting platform. We’re going to look for our service metric quota breach.
Now, I’m brought into our alerting user interface — where we can set up alerts on all of our different metrics, notify the right people, and give them the tools they need to manage their quotas effectively. The first thing I want to call out is our signal grouping feature, which allows you to have much less number of alerts. We create monitors that have a certain PromQL query defined by that monitor. But, depending on how we’re grouping the results of that query, we can set up any number of unique alerts.
In this scenario, I have my PromQL query that’s grouping by app name, and then depending on that label value of that app name, I can set up a different notification policy per unique label set. So I have one monitor, but I actually have six different unique alerts, dependent on what the service is or what the application name is, and hat will define who that alert goes to.
This is a great way to reduce that alert fatigue and manage your alerts without having thousands. We’ve seen customers go from tens of thousands of alerts, down to a couple hundred because they’re able to organize and effectively set up these alerts in a very organized fashion.
Continue this part of the conversation at minute 12:18.
Saving on costs with aggregation rules
Vince: Dropping data is pretty straightforward. [Chronosphere] really increases developer productivity, because what may have taken hours or days to figure out how I’m using the data is now all in one place. So, it just takes seconds to open up the screen, see how my team’s using the data, and then create those drop rules.
The second way that we reduce the data footprint and save on costs, is through aggregation of the data. So, if you’re in a highly containerized environment and you’re collecting data from multiple hundreds or thousands of pods, that cardinality can grow out of control very quickly. If we take one of our more utilized metrics that has a utility score of 12, this means the data, metric name, and metric is being queried. It’s in my dashboards, it’s in my alerts, it’s being queried by my team. But, I can take a deeper dive into this metric by label, and look at the utility score of each label.
For shopping checkout status total, that’s more of an application metric, right? You probably don’t want the infrastructure labels, and you’re probably not finding value out of them. So, this view shows you that instance. It has a unique cardinality of 24, but the utility score is 0. So, this is an opportunity for us, as the data streams in, to drop the instance label because it’s not providing any value, which is going to reduce the number of active time series that we’re tracking for this metric, potentially by 24X.
Now that we have this data point, we know one of these labels is not valuable. We have the ability to create aggregation rules for each of the unused metrics or unused labels, and reduce the data volume that way. I’ll focus on these two aggregation rules here. Within each aggregation rule, you have your filter, just a name match. You can match simply by name, or you could also match by metric type or any combination of a filter around the label.
Once you have the metrics that you want to aggregate on, you then decide if you want to keep all of the current labels and discard a subset. In that scenario, we identified that instance was not valuable for that metric – so let’s discard instance. Or, do you want to discard everything and only keep a certain number of labels? In addition, we have the ability to drop that raw data afterwards. But also if you wanted to keep the raw data, that’s something that’s another option. This aggregation rule could be used for speeding up dashboards. So, if you have an executive dashboard that’s a roll up of a lot of data, you can pre-aggregate this data so that that dashboard is super snappy and you’re not waiting seconds or minutes for that to load.
Chronosphere’s Collection Home
Vince: What we just went over is: We have data growth, we want to set quotas so that we’re managing that data growth, and then we also want to provide the tools for our teams to increase their developer activity by just identifying ways that they can reduce it and giving them the actual tools to reduce it through dropping or aggregating data.
Let’s shift gears into your developers who are saying: “I’m overwhelmed by logging the observability system. I see too much stuff, or I get a page and I know what the page is, but I don’t know where to go.”
Chronosphere offers two things for that: The first thing we have is our Collection Home. When our users log into Chronosphere, we want them to land on a page that is contextually important for them. If I were a developer who is managing the checkout service, when I log into Chronosphere, I’d log in and I’d see this page. This Collection Home is a template that is configurable, so we can add any important metrics, bring in the monitors and dashboards that are important for this service, as well as link to any other collections that are dependent or connected to my checkout service.
For this, I know the checkout service uses MySQL, so I can go to MySQL collection, as well as checkout service as part of the order service. So, that’s where I sort of clicked the link here. We also have helpful links to Runbook, other notification policies, as well as the team members. So, you can think of this as like a home wiki page where, when you’re engineers log in to their observability platform, they’re not intimidated by seeing hundreds of dashboards. They see their dashboards, and they can quickly explore what is important to their service.
From here, we again want this to be templatized, so you’re managing this out-of-the-box. If you have a new service, they get a Collection Home, they get the monitors and dashboards, and that’s again all managed as configuration, as code to help you as an observability team, increase your productivity as well as your other developers productivity, so they know what’s standard and expected.
The second thing we wanted to walk through was, let’s say you get an alert. How do you root cause analysis an alert without having to context switch, or think about where you’re going next?
Continue this portion of the conversation at minute 23:57.
Using configuration as code with Chronosphere
Vince: Finally, I want to talk about configuration as code. How do we manage all of our templates: our Collection Homes, our dashboards, our alert configurations? How do we do that using our configuration as code tools?
So, I’m on my alerting view, and if I click Edit for this monitor, I’m taken into the visual editor. Let’s say I’m in here, and I was like: “I love to edit my monitors through my visual editor and through the UI.” So, I’m going to just add a tag here, and maybe we’ll change the sustain value to 5 minutes.
Now, I’m sandboxing my changes, right? I’m using the UI to create the changes I want. But I can’t just save this, that’s going to screw up my GitOps workflow. I want to make sure that I maintain that state. So, what Chronosphere offers is the ability to see my configuration for my Terraform provider, or if I’m using my command, the Chronoctl command line tool or the API.
I can also highlight: “What work did I do? What changes did I make?” So, this is a great way to start to integrate that workflow from the user experience, as well as the configuration of these tools, and create a kind of synergy between the two.
Continue the rest of the talk and Q&A portion at minute 30:32.