Chronosphere Product Manager Julia Blase explores how metrics quotas protects teams and enables experimentation.
On: Jan 25, 2023
You’re a developer on a well-run team with a technical lead who helps ensure that the relevant metrics that your services produce get highlighted in your observability system dashboards and monitors.
One day you’re looking at a dashboard. Suddenly, the data at the end of the graph starts to disappear. One data point comes in, then nothing, then three in a row. You ask your team mates if anything has changed, but they can’t help. What happened?
Or, depending on your observability system, instead of data disappearing, you and other tech leads in the company get a heated email from finance. An active metrics growth explosion is causing a budget overrun of thousands of dollars and they must know who is causing it and why. Everyone runs around looking to see if it was themselves or another teammate! It’s a whodunnit in the least fun way.
These situations are frustrating, and occur with some regularity. And they are made worse because anyone and everyone that uses the system feels the pain.
In response, we’ve built Metrics Quotas — the ability to allocate specific portions of your license to the internal team(s) or organization(s) that make the most sense for you.
At Chronosphere, our mission is to build a system that makes metrics traffic management simpler. But having a single, system-wide metrics limit (or no limit at all) can make attribution and management of metrics growth a frustrating experience for a central observability team (COT). It can also make individual teams feel less confident when they onboard new services or experiment with changes — what if their change causes metrics to increase in a way that gets the entire company in trouble?
For the COT, Metrics Quotas solves the problem of one team’s metrics growth crowding out metrics that another team relies on. Instead, each team is guaranteed to be able to operate within their protected quota even when other teams’ metrics spike; no COT refereeing needed.
Metrics Quotas also relieves the COT from having to run around playing emergency whack-a-mole to find and remedy metrics volume spikes. Instead, when a spike occurs, the COT can quickly identify the team causing it, know it’s contained so that other teams aren’t impacted, and start working directly with the implicated team to address it.
For individual teams, Metrics Quotas gives each service team a sense of security. They will know their specific license allocation, have confidence that they will never lose metrics data as long as they don’t exceed their quota, and be aware of the potential consequences if they do exceed it.
It also gives them greater transparency into what metrics they are generating and how their traffic shape changes over time, and helps empower them to make changes where and when they need to.
There are several scenarios where Metrics Quotas can help the COT monitor and manage metrics growth while encouraging experimentation:
Practically, to use Metrics Quotas, COTs set up Metrics Quota “Pools” and give each a name, give each pool a % of their license limit (quota), and assign metrics to each pool.
So far, we have observed that the best practices for setting up Metrics Quotas are to align pools with teams, or another ownership structure in your organization, and then to assign metrics to the pool/team that owns or manages them.
Customers tend to use the unique values for a tag such as “Service” to do this assignment. For instance, a customer might set up pools such that Pool A is aligned with Team A, and contains metrics for services that Team A owns.
Metrics Quotas will be visible in a system dashboard, where you can track how much of its quota each pool is consuming at a given time. You’ll be able to track when a pool is getting penalized, and you’ll be able to navigate to the Analyzer in order to see specifically what is being negatively affected.
Finally, you’ll be able to alert when pools are near to or exceed their limit, so that you can work with the impacted team to reshape their traffic and get them back within their quota — or expand their quota if it is truly needed!
Metrics Quotas eliminate the “noisy neighbor” problem of one team’s metrics growth crowding out another team’s metrics. Quotas give teams a safe environment to experiment with their metrics without worrying about affecting their colleagues.
The Metrics Quotas feature is the first step in the journey to empower service teams and developers to manage their metrics and use their quotas as they see fit.
We’ll continue to release new capabilities that allow service owners more metrics control and ownership. With these features, the COT can move on from firefighting overruns and instead provide strategic input and advice to service owners – and help get the most business value out of the metrics.
If you’re interested in this or other Chronosphere features, watch a webinar, read a blog, or set up time for a demo!
Request a demo for an in depth walk through of the platform!