A persistent topic in monitoring and observability is data cardinality; more specifically, having access to high cardinality data so that we can answer different questions and generally better understand the systems that we build and operate.
In the realm of metric data specifically, cardinality is especially important because there is an explicit tradeoff being made when we add more dimensions to our metrics. As engineers, it’s important for us to think about this tradeoff between the dimensions we add to our metrics and the value it is providing us, to ensure that we are getting an acceptable ROI from the additional cardinality.
Metrics and cardinality
Metrics are really an optimization – we aggregate data (such as requests from a web service) along specific dimensions so that we can quickly and easily know about changes in behavior, or understand how things are trending over time without having to store all of the raw data we’d otherwise need in order to derive what we are measuring. As we add more dimensions to our metrics, we increase the cardinality. This means we are able to answer more questions, but the tradeoff here is that we have more data to store, and querying that data will be slower as well. Thus, it’s important for us to strike a balance in the dimensions that we add to our metrics, so as not to increase the cardinality so much that it dilutes their value.
In some cases high-cardinality may be highly valuable/necessary, but it’s good to validate that the dimensions we add to our metrics are actually being used in our dashboards/alerts, and not something that we are collecting “just in case” – that may be a sign that a dimension is not providing enough value to be part of a metric, and should be included in other observability data such as distributed traces or logs.
When talking about cardinality in metrics, we can classify dimensions into three high-level buckets to help us think about the balance between value and cardinality:
- High value: These are the dimensions we need to measure to understand our systems, and they are always/often preserved when consuming metrics in alerts/dashboards.
- A simple example here would be including service/endpoint as a dimension for a metric tracking request latency. There’s no question that this is essential for us to have the visibility that we need to make decisions about our system. In a microservices environment, even a simple example like this can end up adding quite a lot of cardinality. When you have dozens of services each with a handful of endpoints, you quickly end up with many thousands of series even before you add other sensible dimensions such as region, status code, etc.
- Low value/incidental: These are dimensions that are of more questionable value. They may not even be intentionally included, but rather come as a consequence of how metrics are collected from our systems.
- An example dimension that could apply here would be the instance label in Prometheus – it is automatically added to every metric we collect. In some cases we may be interested in per-instance metrics, but if you are looking at a metric such as request latency for a stateless service running in kubernetes, you may not look at per-instance latency at all when you look at dashboards/alerts for your service, so having it as a dimension does not necessarily add much value in that case.
- Useless/harmful: These labels are essentially antipatterns to be avoided at all costs. Including them can result in serious consequences to our metric system’s health, by exploding the amount of data we collect and causing significant problems when we query metrics.
- A good example here would be tagging our request latency metric by request ID – every request would then generate a unique series, and the value of a metric has been destroyed, as we are no longer aggregating requests together at all. In general, dimensions that have unbounded cardinality are risky to include on metrics – they may not cause issues right away, and can be quite difficult to detect and resolve later on if you do not have good visibility into the metrics you are ingesting.
Look at how metrics are consumed
When we look at the metrics we collect in our systems today, the best way to classify the dimensions we have is by looking at how the data is being consumed, as that tells us how valuable the cardinality is in practice:
- High value dimensions will be regularly preserved in alerts and graphs, and may be a common variable when looking at key dashboards for a service.
- Low value/incidental dimensions might show up in a few graphs, but they will not be preserved in alerts, and in the case of Prometheus, they are probably being excluded in recording rules to help speed up more expensive queries.
- Useless/harmful dimensions ideally should not be seen anywhere in your metrics – if you see any, they should be blocked/removed as soon as possible, using the controls in your monitoring system or via a code change if necessary.
Classifying the dimensions that we capture for our metrics is a useful exercise to ensure that you are getting the value you want from them relative to the amount of data you are collecting. It’s difficult to do this holistically though; we can look at individual microservices easily enough, but in a system where hundreds or even thousands of services are running, it’s too time-consuming to go through them one at a time.
To really control your observability data here, you need tools that allow you to quickly identify and investigate high-cardinality data, so you can triage outliers as-needed, rather than systematically auditing all of your metrics.
This is something that Chronosphere helps our customers with, by allowing them to profile their metrics, and create rules to aggregate away high-cardinality/low-value dimensions before they are stored. As a result, they get the best of both worlds: the value of high-cardinality metrics, and also the ability to easily understand the ROI of their observability data, instead of having to spend tons of time trying to reduce how much metric data they are collecting.