Discussion about cardinality in metrics and how classifying dimensions into buckets helps explain the balance between value and cardinality.
On: Feb 15, 2022
A persistent topic in monitoring and observability is data cardinality; more specifically, having access to high cardinality data so that we can answer different questions and generally better understand the systems that we build and operate.
In the realm of metric data specifically, cardinality is especially important because there is an explicit tradeoff being made when we add more dimensions to our metrics. As engineers, it’s important for us to think about this tradeoff between the dimensions we add to our metrics and the value it is providing us, to ensure that we are getting an acceptable ROI from the additional cardinality.
Metrics are really an optimization – we aggregate data (such as requests from a web service) along specific dimensions so that we can quickly and easily know about changes in behavior, or understand how things are trending over time without having to store all of the raw data we’d otherwise need in order to derive what we are measuring. As we add more dimensions to our metrics, we increase the cardinality. This means we are able to answer more questions, but the tradeoff here is that we have more data to store, and querying that data will be slower as well. Thus, it’s important for us to strike a balance in the dimensions that we add to our metrics, so as not to increase the cardinality so much that it dilutes their value.
In some cases high-cardinality may be highly valuable/necessary, but it’s good to validate that the dimensions we add to our metrics are actually being used in our dashboards/alerts, and not something that we are collecting “just in case” – that may be a sign that a dimension is not providing enough value to be part of a metric, and should be included in other observability data such as distributed traces or logs.
When talking about cardinality in metrics, we can classify dimensions into three high-level buckets to help us think about the balance between value and cardinality:
When we look at the metrics we collect in our systems today, the best way to classify the dimensions we have is by looking at how the data is being consumed, as that tells us how valuable the cardinality is in practice:
Classifying the dimensions that we capture for our metrics is a useful exercise to ensure that you are getting the value you want from them relative to the amount of data you are collecting. It’s difficult to do this holistically though; we can look at individual microservices easily enough, but in a system where hundreds or even thousands of services are running, it’s too time-consuming to go through them one at a time.
To really control your observability data here, you need tools that allow you to quickly identify and investigate high-cardinality data, so you can triage outliers as-needed, rather than systematically auditing all of your metrics.
This is something that Chronosphere helps our customers with, by allowing them to profile their metrics, and create rules to aggregate away high-cardinality/low-value dimensions before they are stored. As a result, they get the best of both worlds: the value of high-cardinality metrics, and also the ability to easily understand the ROI of their observability data, instead of having to spend tons of time trying to reduce how much metric data they are collecting.
Request a demo for an in depth walk through of the platform!