In an increasingly cloud-native world, companies are constantly searching for ways to monitor and manage their suite of microservices. Sounds straightforward, but the exponential rise in data production has complicated the situation. Today, cloud-native companies are operating at larger and larger scales and developing more demanding, complex use cases. As a result, having a highly reliable, scalable, and efficient monitoring solution has become critical to these companies’ success.
Prometheus was developed back in 2012 in response to this growing need for cloud-native monitoring services. It has since become the de facto and Cloud Native Computing Foundation (CNCF) recommended open-source metrics monitoring solution. Prometheus’ ease of use (i.e. single binary, out-of-the-box setup), text exposition format and query language (i.e. PromQL), and large ecosystem of exporters (i.e. custom integrations) have led to its widespread adoption. Today, companies around the world have integrated Prometheus into their existing cloud-native architectures to solve various monitoring use cases.
Prometheus was designed with an efficient data store to optimize for quick query requests and alerts. By storing data locally on disk, Prometheus is great for short-term use cases. However, when storing and querying against longer-term (and larger scale) data, it can easily become overwhelmed.
If you’ve reached the stage where you need to look for another monitoring solution, you’ve probably already started to feel the limitations of Prometheus at scale. Here are some of the top signs that your engineering organization may be outgrowing Prometheus.
Note: To learn more about the design and storage components of Prometheus that have led to these pain points or limitations, read our more detailed write-up on outgrowing Prometheus.
1. Your engineers are having trouble quickly locating data
In most high volume use cases, users will have multiple Prometheus instances scraping metrics across their services (to avoid OOMs or overwhelming a single instance). With this setup, federation across instances is needed to achieve a centralized view of data. At a large scale, this can quickly become unmanageable as awareness of where each datapoint lives is needed to query or alert against specific data points (note: each federated instance only contains a subset of data from its respective instances). As a Prometheus cluster scales, more federation is needed in order to achieve a single point of query. And the more federated instances there are, the harder it is to track where specific data points live, resulting in increased overhead management from engineering teams.
If you’re struggling to get a global view of data across a service’s multiple Prometheus instances, and/or your engineers are struggling to keep track of where metrics are located after federation, you’re outgrowing Prometheus.
2. You’re working on mission-critical services and can’t afford data loss
By default, Prometheus is not highly available (HA). This means that if your instance of Prometheus scraping Service A goes down, you not only lose real time monitoring of Service A, but you also temporarily lose access to historical data. As a workaround, many users will spin up a second instance for a given service (creating a HA model) to ensure a full copy of data remains in case the other instance goes down. While an HA Prometheus setup can work for many use cases, it exposes various limitations at scale, such as inconsistent query results due to gaps in graphs when performing rolling restarts of your instances (i.e. the load balancer will return data from either one instance or the other). Unfortunately, Prometheus has no way of merging datasets to fill in these gaps.
If you can’t afford to lose any of your monitoring data, out-of-the-box Prometheus may not provide the support and durability you need. Any time you’re working on mission-critical applications that demand ultra-fast debugging, you don’t want to worry about data gaps or node failures.
3. You need to store and query your data at longer-term retention periods
Optimized for storing short-term data, Prometheus does not have built in downsampling capabilities. This means that if you want to view or store data at longer-term retention periods, you need to spin up a second instance for a given service (through federation) to store a subset of the data at a longer retention. The problem with this, however, is that by creating multiple data sources or dashboards for your service, you’re not able to view and query your metrics in a single place. Having the ability to downsample would allow you to store samples of a service’s metrics at longer-term retention periods within a single instance, therefore freeing up capacity for more granular, short-term metrics.
If you need a highly granular view of your data while maintaining metrics at longer retention periods for querying or alerting purposes, then you need a solution with built-in downsampling and efficient compression algorithms to optimize your data store and increase the utility of your metrics.
4. Your monitoring costs are out of control
In order to scale out your Prometheus operation (or to make it HA), more and more instances are needed. And the more instances you have, the more overhead is needed to manage the nodes, including maintaining awareness of the data within each node. Not only does this increase your spend due to growing storage and compute costs, but it also means that engineers are now forced to spend time managing the Prometheus cluster instead of focusing on higher value work. The labor cost of maintaining Prometheus as an open-source monitoring system is actually one of its highest costs.
If your monitoring costs are growing and teams are spending more and more time maintaining the system, and you don’t know how to get this back under control, you’re probably ready for a more scalable cloud-native monitoring solution. Huge monitoring costs force engineering teams to make uncomfortable trade-offs between cost and performance.
While outcomes can vary by use case and amounts of data being produced or consumed, the above four points are some of the leading indicators for outgrowing Prometheus that we have observed. It is possible to self-run and manage Prometheus at a large scale, but it’s important to understand the tradeoffs when doing so, especially when it comes to the engineering overhead and costs needed to ensure and maintain a highly reliable and efficient infrastructure.
Not wanting to dedicate more and more time and resources to managing their Prometheus infrastructures, many users with a need for a longer term metrics store have either turned to a hosted metrics monitoring solution, like Chronosphere, or an open-source Prometheus remote storage compatible solution, like M3, to help with the complex operation and management of Prometheus at scale.
Built on top of M3, Chronosphere is building a next-level cloud-native monitoring platform that is:
If your team is suffering from monitoring data that’s hard to find – because the data is completely lost or not collected for cost control reasons – it might be time to consider a cloud-native monitoring tool that’s built for massive scale. Reach out to contact@chronosphere.io or request a demo to see if Chronosphere may be a good fit for your monitoring needs.
Request a demo for an in depth walk through of the platform!