In this series exploring what to look for in a Prometheus-native monitoring solution, we guide you through four considerations to keep in mind as you evaluate different approaches to monitoring cloud-native apps. We break it all down, covering:
- Availability and Resiliency
- Cost & Control (current blog)
- Security & Administration
- Performance & Scale
We kicked off our first installment by defining what it means for any monitoring solution to qualify as “Prometheus-native.” We then highlighted availability and reliability as the first of four capabilities your monitoring-solutions vendor should offer. (You can catch up by reading, Prometheus-native monitoring: Availability and reliability.)
In today’s installment, we explore why cost & control is a must-have in your next Prometheus-native monitoring solution.
Combating metrics data growth & resulting costs
Cloud-native environments emit a massive amount of monitoring data — especially as developers add more labels to their metrics causing large cardinality spikes. As monitoring data volumes grow, so do costs, and this growth is often outpacing the growth of overall infrastructure costs.
To combat metrics data growth (and ultimately costs) organizations need the ability to make decisions based on business value about what data is kept, for how long, and at what resolution. This not only helps keep cost under control, but can make it easier and faster to find the data needed to solve problems.
When it comes your monitoring solution’s ability to offer cost and control, there are four areas to scrutinize:
1) Evaluate the vendor’s pricing model
Unfortunately, there is no single standard for pricing Prometheus-native monitoring solutions, so it can be hard to compare apples to apples on price. Instead, focus on determining if the pricing is:
- Easy to understand
- Only grows as your value from the tool grows
2) Downsampling & aggregation
One of the ways many organizations deal with increasing volumes of data is by downsampling and aggregation. Downsampling allows for historical metric data to be read with better performance while reducing storage costs over time. Aggregation of metric data focuses on better query performance of high cardinality metric data.
Many tools on the market today rely on native Prometheus methods of downsampling and aggregation called Recording rules, which work by ingesting all the raw metric data into the time series database (TSDB) first before reading them and then producing the aggregated and downsampled metric data back into the TSDB.
One of the disadvantages to this approach is that it does not reduce the overall volume of data stored since all raw unaggregated data must be kept in order to create aggregated data — in fact, the overall volume of data increases.
You want to look for a SaaS vendor that gives you the ability to optionally discard the raw underlying data and only keep the downsampled or aggregated data. This will heavily reduce the volume of persisted data and correspondingly reduce costs.
3) Pick and choose retention and resolution policies
There are so many different use cases for monitoring in modern cloud-native environments and each has its own set of requirements. For example, a user may want to view the per container metrics and break it down by the pod UUID (universally unique identifier) as it’s critical for deployments. This data will be extremely high cardinality and useful in real-time, but becomes less useful over longer periods of time — after a year, no user will care about the specifics of a particular pod since it most likely will not have existed for a very long time. Alternatively, specific aggregated metrics are useful for trend analysis over longer periods of time. Since there are so many different use cases for monitoring data, it does not make sense to try and fit a single data retention and resolution policy across all of them.
For maximum efficiency, you want a solution that allows you to pick and choose different retention and resolution policies for different subsets of your monitoring data – one that is tailored for each use case.
4) Visibility is key to controlling costs
Being able to gain visibility into how much data each team or user is putting into the monitoring system, as well as how much data is fetched, is key to controlling rising costs over time. Ideally, this visibility is broken down into logical units that make sense for each company:
Visibility goes hand-in-hand with enforcement when it comes to control. Ideally you can take the new visibility you’ve gained and use it to limit data writes or reads by groups. The goal of putting these limitations in place is to ensure that one user cannot impact another’s experience.
You’ll want a solution that allows you to give the rate-limited end users the ability to resolve the issue without requiring them to redeploy or configure their instrumentation. This leads to faster resolution.
The solution does not have to lead to data being dropped – downsampling, aggregation, and control over resolution and retention are all mechanisms that allow the end user to make choices around their data in order to optimize for both the use case and cost.
Next time, in part three of this series, we will explore considerations around performance & scale. Stay tuned.
Other blogs in this series:
Prometheus-native monitoring: Availability and reliability
Prometheus-native monitoring: Security & administration
Prometheus-native monitoring: Performance & scale