Know the hidden costs of DIY Prometheus

on April 4th 2023

When it arrived on the scene, the Prometheus open source system monitoring toolkit gave overworked observability teams a way to succeed in the modern business world. It was a metrics-based observability system that would ensure their environments are working as needed. 

Yet building out your own Prometheus instance often can only take you so far, and businesses are finding out that running Prometheus in-house is neither scalable nor reliable enough to handle their rapidly growing cloud native environments. 

Why start with Prometheus for metrics-based observability?

Do it yourself (DIY) Prometheus is a natural starting point for many companies as they begin their cloud native journey. It’s free, it’s open source, and there are great community contributions and support. 

However, as their cloud native environment grows and engineers demand more data to optimize their apps and infrastructure, Prometheus requires a more complex architecture — and more staff bandwidth — to scale. At some point, nearly every organization gets to a point where managing a complex Prometheus implementation in-house is anything but free. It becomes more costly and consumes more engineering resources than your production environment. 

What are the four common challenges of DIY Prometheus? 

1. Data becomes hard to find

You know you’re bumping up against the limits of Prometheus when you’re hearing complaints from engineers that they can’t locate observability data quickly. To scale Prometheus, you need to spin up separate instances and have each instance store and scrape data from specific services. This will manually shard the load across your Prometheus instances, but this can cause problems as you scale. 

Dashboards and alerts for Prometheus instances

From a dashboarding and alerting perspective, you need to tell each dashboard or alert which node/Prometheus instance to point to to get the data. You also may have a single dashboard or alert that needs data from multiple Prometheus instances, so you federate instances and create a subset of data for the original instances. 

The bottom line is that scaling Prometheus leads to more federated nodes, which leads you to having a much more complicated Prometheus structure. And as you do this across zones or regions, you need to federate the data in another Prometheus instance and combine that across both zones or regions. Engineers need to remember which Prometheus instance contains the data they are looking for. You’ll likely hear from engineers that it just takes too long to find data, run queries and fix issues.

2. Poor reliability results in data loss 

Out-of-the-box Prometheus has a significant point of failure, so if it goes down, you lose active data and access to historical data. So, it’s always recommended to run multiple instances that both scrape the same endpoints. This way, if one goes down you still have a copy of your metrics. 

Relying on dashboards

Another best practice is to run load balancers and point your dashboard instance to the load balancer. This generally works for reliability in the sense that you get one copy of the data. The problem is that if you are doing rolling restarts of your Prometheus instances, then you’ll come across a gap in your data as the Prometheus instance is down and restarting. Again, the bottom line is you may need a longer-term storage solution or remote storage solution or perhaps distribute across multiple cloud regions and cloud providers for fault tolerance. This again adds complexity and an operational burden on your engineering teams.

3. Longer data retention gets expensive

Teams will often demand that they need to retain more data longer to be more effective at troubleshooting. However, Prometheus is not really efficient for long-term data. There are no built-in downsampling capabilities. 

As an example, if storing one instance for six months at a scrape interval of 30 seconds, it ends up being approximately 8100 Kbs. But if you were able to downsample to a one-hour resolution for six months, it would use approximately 67.5 Kbs. So as you store more and more longer-term data, downsampling becomes very valuable for efficiency. There are some workarounds, but it adds complexity and engineer time to manage.

4. Data growth forces tough trade-offs

A clear sign you’re bumping up against the limits of DIY Prometheus is you’re being forced to make difficult data collection vs. cost trade-offs. In a perfect world, we capture everything so we always have the data we need. But in practical terms, the sheer volume of observability data as you transition from cloud to cloud native is increasing at a faster rate than your production environment. 

If you were running on a VM and now you’re running on containers, your infrastructure and cloud bills are pretty much the same, with the same cluster size. But instead of tens of VMs, you’re now running hundreds or thousands of containers, each of which is generating the same amount of telemetry data as the VMs. Your observability costs are higher than the infrastructure supporting your apps. If you’re reducing monitoring as you move to containerized applications, it’s likely time for a more scalable solution.

So if Prometheus can’t keep up, what’s to be done? When you’ve gotten as much as you can out of your DIY Prometheus implementation, it’s time to consider a Prometheus alternative. When evaluating solutions, an important consideration is how that solution can leverage the investment you’ve made in your existing Prometheus environment, specifically: 

  • Instrumentation
  • Data collection
  • Data presentation (dashboards and alerts)

A managed solution should leverage your instrumentation and data presentation, but alleviate the increasing cost and operational burden of managing an observability platform in-house.

How Chronosphere can help

Chronosphere was built from the ground up for cloud native scale, complexity and reliability. Chronosphere helps engineers be more productive by giving them faster and more actionable alerts that they can triage rapidly. Plus, it allows them to spend less time on monitoring instrumentation and more time delivering innovation that grows your business. 

The data

According to Forrester Research, a typical Chronosphere customer sees a 165% return on investment and $7.75M in benefits over three years. The average customer reduces their observability data volumes by 48% after transformation, while improving their observability metrics.

To learn more, read the Forrester Total Economic Impact study.

Other resources you may be interested in

Learn more about how Chronosphere can help you