Introducing Metrics Quotas: Protect yourself from cardinality explosions and budget overruns 

on January 25th 2023

TLDR; Unexpected metrics growth can push new, noisy data into a system while pushing out well-known, high-signal data. Metrics Quotas prevent engineering teams from being unfairly penalized by other teams’ metrics growth, and help align incentives for COT managers and engineering teams to jointly monitor and manage metrics growth over time.  

You’re a developer on a well-run team with a technical lead who helps ensure that the relevant metrics that your services produce get highlighted in your observability system dashboards and monitors. 

One day you’re looking at a dashboard. Suddenly, the data at the end of the graph starts to disappear. One data point comes in, then nothing, then three in a row. You ask your team mates if anything has changed, but they can’t help. What happened? 

Or, depending on your observability system, instead of data disappearing, you and other tech leads in the company get a heated email from finance. An active metrics growth explosion is causing a budget overrun of thousands of dollars and they must know who is causing it and why. Everyone runs around looking to see if it was themselves or another teammate! It’s a whodunnit in the least fun way. 

These situations are frustrating, and occur with some regularity. And they are made worse because anyone and everyone that uses the system feels the pain. 

In response, we’ve built Metrics Quotas — the ability to allocate specific portions of your license to the internal team(s) or organization(s) that make the most sense for you.

Protect teams and enable experimentation 

At Chronosphere, our mission is to build a system that makes metrics traffic management simpler. But having a single, system-wide metrics limit (or no limit at all) can make attribution and management of metrics growth a frustrating experience for a central observability team (COT). It can also make individual teams feel less confident when they onboard new services or experiment with changes — what if their change causes metrics to increase in a way that gets the entire company in trouble? 

For the COT, Metrics Quotas solves the problem of one team’s metrics growth crowding out metrics that another team relies on. Instead, each team is guaranteed to be able to operate within their protected quota even when other teams’ metrics spike; no COT refereeing needed. 

Metrics Quotas also relieves the COT from having to run around playing emergency whack-a-mole to find and remedy metrics volume spikes. Instead, when a spike occurs, the COT can quickly identify the team causing it, know it’s contained so that other teams aren’t impacted, and start working directly with the implicated team to address it.

For individual teams, Metrics Quotas gives each service team a sense of security. They will know their specific license allocation, have confidence that they will never lose metrics data as long as they don’t exceed their quota, and be aware of the potential consequences if they do exceed it. 

It also gives them greater transparency into what metrics they are generating and how their traffic shape changes over time, and helps empower them to make changes where and when they need to. 

When to use Metrics Quotas 

There are several scenarios where Metrics Quotas can help the COT monitor and manage metrics growth while encouraging experimentation: 

  • Accident management: Someone adds a UID to a label, and suddenly the metric’s cardinality (and data points per second) grows significantly. Set up Metrics Quotas for a given team’s set of services so that when this type of accident happens, the “blast radius” of the metrics growth (and any license penalties) is isolated to the team that caused it. 
  • Experimentation: A developer that spins up a new service for their team may want to instrument a lot of metrics to get an idea of how the service is working. Set up Metrics Quotas for their team so that they can add as many metrics as they want and know that they’ll only be impacting their own team — nothing they do can hurt any other team’s metrics, nor the system as a whole. 
  • Slow metrics growth: Your company is seeing more users over time. Some microservices see this growth impact their metrics volume more than others. From one day to the next, the growth isn’t significant, but two months later, some services’ metrics volume has grown by 20%! Set up Metrics Quotas for all of your teams so that areas of growth are easy to identify and target with traffic shaping tools (rollup, drop, recording rules, and more). 
  • Cyclical traffic: Perhaps some of your service metrics are very critical in the daytime, but less important overnight, when you’ve scheduled a large set of batch jobs to run. Set up Metrics Quotas to ensure daytime service metrics are protected and that there is sufficient budget headroom for these critical metrics to be persisted. But don’t worry about the nighttime jobs — daytime-critical service quotas will have extra “headroom” overnight, which can and will be automatically utilized by those nighttime batch jobs. In fact, any time there is headroom in the system, other services will be able to take advantage of it, without any manual intervention from the COT.

How to use Metrics Quotas  

Practically, to use Metrics Quotas, COTs set up Metrics Quota “Pools” and give each a name, give each pool a % of their license limit (quota), and assign metrics to each pool. 

So far, we have observed that the best practices for setting up Metrics Quotas are to align pools with teams, or another ownership structure in your organization, and then to assign metrics to the pool/team that owns or manages them. 

Customers tend to use the unique values for a tag such as “Service” to do this assignment. For instance, a customer might set up pools such that Pool A is aligned with Team A, and contains metrics for services that Team A owns.

Metrics Quotas will be visible in a system dashboard, where you can track how much of its quota each pool is consuming at a given time. You’ll be able to track when a pool is getting penalized, and you’ll be able to navigate to the Profiler in order to see specifically what is being negatively affected. 

Finally, you’ll be able to alert when pools are near to or exceed their limit, so that you can work with the impacted team to reshape their traffic and get them back within their quota — or expand their quota if it is truly needed!

Wrapping up 

Metrics Quotas eliminate the “noisy neighbor” problem of one team’s metrics growth crowding out another team’s metrics. Quotas give teams a safe environment to experiment with their metrics without worrying about affecting their colleagues. 

Benefits for COTs

  • Less time firefighting overruns, more time working with teams to help them become better metrics producers and users. 
  • Less time calming ruffled feathers, whether it’s CFOs dealing with budget overruns, or teams who stayed within their allotted usage but were negatively impacted due to another team’s actions — because that won’t happen anymore. 

Benefits for developers 

  • More freedom to experiment with your kingdom of metrics.The only person impacted if something goes wrong will be you and/or your immediate team. 
  • Stop getting spammed with “was this you?” messages from your COT manager. Investigating issues that aren’t your fault is a poor use of time. 

What’s Next

The Metrics Quotas feature is the first step in the journey to empower service teams and developers to manage their metrics and use their quotas as they see fit. 

We’ll continue to release new capabilities that allow service owners more metrics control and ownership. With these features, the COT can move on from firefighting overruns and instead provide strategic input and advice to service owners – and help get the most business value out of the metrics. 
If you’re interested in this or other Chronosphere features, watch a webinar, read a blog, or set up time for a demo!

Interested in what we are building?