Importance of cardinality
We talk a lot about high cardinality data in the observability world, and the importance of having access to high cardinality data (and also the dangers of too much cardinality). But what do we mean when we talk about data cardinality? To start, we can look at what cardinality is; cardinality is defined as the number of elements in a set or other grouping. To make it a little clearer what that means, let’s walk through an example.
Why does data cardinality matter?
Understanding data cardinality is essential for effective data modeling and analysis. It enables database designers and analysts to create structures that accurately represent real-world scenarios. By defining relationships clearly, it becomes easier to retrieve, update, and analyze information without causing confusion or errors.
Cardinality growth: An example
Say we want to keep track of how many cars pass by on a street; to do so, we can simply keep a running count with each new one that goes by. If we express our count of passing cars as a metric, it might look something like this:
passed_cars_total{}: 10 |
We’re only tracking a single number at this point, the number of cars that have passed us – that means our metric has a cardinality of 1. What if we wanted a little more information about the cars going by though? We could break it down by the type of vehicle, such as sedan, van, SUV, or truck. Then our metric might look like this:
passed_cars_total{vehicle_type=”sedan”}: 4 |
passed_cars_total{vehicle_type=”van”}: 1 |
passed_cars_total{vehicle_type=”suv”}: 3 |
passed_cars_total{vehicle_type=”truck”}: 2 |
By splitting our count out by vehicle_type, we’ve added another dimension to our data, and increased the overall cardinality – now we’re tracking 4 values, so the cardinality of our metric is 4. The dimension we just added has 4 possible values, so the cardinality of vehicle_type is also 4.
We can continue to add more dimensions to our count of passing cars. Let’s also include whether we see a dog in the car! That would make our metric look something like this:
passed_cars_total{vehicle_type=”sedan”, has_dog=”true”}: 2 | passed_cars_total{vehicle_type=”sedan”, has_dog=”false”}: 2 |
passed_cars_total{vehicle_type=”van”, has_dog=”true”}: 1 | passed_cars_total{vehicle_type=”van”, has_dog=”false”}: 0 |
passed_cars_total{vehicle_type=”suv”, has_dog=”true”}: 2 | passed_cars_total{vehicle_type=”van”, has_dog=”false”}: 1 |
passed_cars_total{vehicle_type=”truck”, has_dog=”true”}: 1 | passed_cars_total{vehicle_type=”van”, has_dog=”false”}: 1 |
We can see above that now we have to track 2 values instead of 1 value for each car type. That brings our total to 8, since we have 4 different values for vehicle_type, and 2 possible values for has_dog. So our metric has a cardinality of 8, vehicle_type still has a cardinality of 4, and has_dog has a cardinality of 2. Easy so far, right?
Let’s add one more dimension to our count of passing cars – we want to know how many of the cars are convertibles too. That’s another dimension that has two possible values, so updating our table should be similar to when we added the has_dog dimension, right? Not quite! This time our metric looks a little different when we add the new dimension:
passed_cars_total{vehicle_type=”sedan”, has_dog=”true”, is_convertible=”true”}: 1 | passed_cars_total{vehicle_type=”sedan”, has_dog=”false”, is_convertible=”true”}: 1 | passed_cars_total{vehicle_type=”sedan”, has_dog=”true”, is_convertaible=”false”}: 1 | passed_cars_total{vehicle_type=”sedan”, has_dog=”false”, is_convertible=”false”}: 1 |
passed_cars_total{vehicle_type=”van”, has_dog=”true”}: 1 | passed_cars_total{vehicle_type=”van”, has_dog=”false”}: 0 | ||
passed_cars_total{vehicle_type=”suv”, has_dog=”true”}: 2 | passed_cars_total{vehicle_type=”van”, has_dog=”false”}: 1 | ||
passed_cars_total{vehicle_type=”truck”, has_dog=”true”}: 1 | passed_cars_total{vehicle_type=”van”, has_dog=”false”}: 1 |
What’s different from our last two dimensions? In this case, our new dimension is_convertible isn’t applicable to all of the values we have for vehicle_type, so our metric’s cardinality has gone up to 10, instead of 16 like we might have initially expected. This is an important part to remember when you are looking at the cardinality of data – we’ll frequently have dimensions that only apply to a subset of our data. Because of this, you can’t just multiply together the cardinality of your individual dimensions to know what cardinality overall will look like; it’s better to measure it independently instead, to make sure you get an accurate answer.
The tradeoffs of data cardinality
Now we’ve shown what happens to our data’s cardinality as we add dimensions – we were able to answer questions that we couldn’t when we started, such as how many convertibles with a dog have passed by. There’s a couple of trade-offs here though; as we’ve seen, adding dimensions to our data has the potential to significantly increase the number of values we need to track. What if we also tracked what color each car was? Our table would get a lot bigger!
There’s another problem that comes with high-cardinality as well; originally we had a single number that gave us the number of cars that had passed by, but now if we want to know how many cars have gone by, we have to add 10 values together, so it’s a little more work. That’s a perfectly acceptable exchange for us here, but it’s something we have to be conscious of as we add dimensions to our data – the more cardinality we have, the more work we have to do to get answers to less specific questions. For example, a business measuring service KPIs might add a dimension to allow them to break down key metrics by customer, but breaking those same metrics down by individual users would not give enough additional benefit in comparison to the explosion in cardinality it would cause.
Conclusion
Whether in databases or data analysis, this fundamental concept acts as a guiding force, clarifying the connections between different sets of data. Understanding data cardinality empowers us to design effective data models, fostering accuracy and efficiency in handling information. As we navigate the ever-expanding landscape of digital data, recognizing the significance of data cardinality illuminates the path toward more coherent and insightful data management practices.
Explain the Chronosphere benefits like I’m five
Chronosphere is the only observability platform that puts you back in control by taming rampant data growth and cloud-native complexity, delivering increased business confidence. From engineering organizations at startups to well-known global brands in the Fortune 500, companies around the world trust Chronosphere to help them operate scalable, highly available, and resilient applications.