When it arrived on the scene, the Prometheus open source system monitoring toolkit gave overworked observability teams a way to succeed in the modern business world. It was a metrics-based observability system that would ensure their environments are working as needed. 

Yet building out your own Prometheus instance often can only take you so far, and businesses are finding out that running Prometheus in-house is neither scalable nor reliable enough to handle their rapidly growing cloud native environments. 

Why start with Prometheus for metrics-based observability?

Do it yourself (DIY) Prometheus is a natural starting point for many companies as they begin their cloud native journey. It’s free, it’s open source, and there are great community contributions and support. 

However, as their cloud native environment grows and engineers demand more data to optimize their apps and infrastructure, Prometheus requires a more complex architecture — and more staff bandwidth — to scale. At some point, nearly every organization gets to a point where managing a complex Prometheus implementation in-house is anything but free. It becomes more costly and consumes more engineering resources than your production environment. 

What are the four common challenges of DIY Prometheus? 

1. Data becomes hard to find

You know you’re bumping up against the limits of Prometheus when you’re hearing complaints from engineers that they can’t locate observability data quickly. To scale Prometheus, you need to spin up separate instances and have each instance store and scrape data from specific services. This will manually shard the load across your Prometheus instances, but this can cause problems as you scale. 

Dashboards and alerts for Prometheus instances

From a dashboarding and alerting perspective, you need to tell each dashboard or alert which node/Prometheus instance to point to to get the data. You also may have a single dashboard or alert that needs data from multiple Prometheus instances, so you federate instances and create a subset of data for the original instances. 

The bottom line is that scaling Prometheus leads to more federated nodes, which leads you to having a much more complicated Prometheus structure. And as you do this across zones or regions, you need to federate the data in another Prometheus instance and combine that across both zones or regions. Engineers need to remember which Prometheus instance contains the data they are looking for. You’ll likely hear from engineers that it just takes too long to find data, run queries and fix issues.

2. Poor reliability results in data loss 

Out-of-the-box Prometheus has a significant point of failure, so if it goes down, you lose active data and access to historical data. So, it’s always recommended to run multiple instances that both scrape the same endpoints. This way, if one goes down you still have a copy of your metrics. 

Relying on dashboards

Another best practice is to run load balancers and point your dashboard instance to the load balancer. This generally works for reliability in the sense that you get one copy of the data. The problem is that if you are doing rolling restarts of your Prometheus instances, then you’ll come across a gap in your data as the Prometheus instance is down and restarting. Again, the bottom line is you may need a longer-term storage solution or remote storage solution or perhaps distribute across multiple cloud regions and cloud providers for fault tolerance. This again adds complexity and an operational burden on your engineering teams.

3. Longer data retention gets expensive

Teams will often demand that they need to retain more data longer to be more effective at troubleshooting. However, Prometheus is not really efficient for long-term data. There are no built-in downsampling capabilities. 

As an example, if storing one instance for six months at a scrape interval of 30 seconds, it ends up being approximately 8100 Kbs. But if you were able to downsample to a one-hour resolution for six months, it would use approximately 67.5 Kbs. So as you store more and more longer-term data, downsampling becomes very valuable for efficiency. There are some workarounds, but it adds complexity and engineer time to manage.

4. Data growth forces tough trade-offs

A clear sign you’re bumping up against the limits of DIY Prometheus is you’re being forced to make difficult data collection vs. cost trade-offs. In a perfect world, we capture everything so we always have the data we need. But in practical terms, the sheer volume of observability data as you transition from cloud to cloud native is increasing at a faster rate than your production environment. 

If you were running on a VM and now you’re running on containers, your infrastructure and cloud bills are pretty much the same, with the same cluster size. But instead of tens of VMs, you’re now running hundreds or thousands of containers, each of which is generating the same amount of telemetry data as the VMs. Your observability costs are higher than the infrastructure supporting your apps. If you’re reducing monitoring as you move to containerized applications, it’s likely time for a more scalable solution.

So if Prometheus can’t keep up, what’s to be done? When you’ve gotten as much as you can out of your DIY Prometheus implementation, it’s time to consider a Prometheus alternative. When evaluating solutions, an important consideration is how that solution can leverage the investment you’ve made in your existing Prometheus environment, specifically: 

A managed solution should leverage your instrumentation and data presentation, but alleviate the increasing cost and operational burden of managing an observability platform in-house.

How Chronosphere can help

Chronosphere was built from the ground up for cloud native scale, complexity and reliability. Chronosphere helps engineers be more productive by giving them faster and more actionable alerts that they can triage rapidly. Plus, it allows them to spend less time on monitoring instrumentation and more time delivering innovation that grows your business. 

The data

According to Forrester Research, a typical Chronosphere customer sees a 165% return on investment and $7.75M in benefits over three years. The average customer reduces their observability data volumes by 48% after transformation, while improving their observability metrics.

To learn more, read the Forrester Total Economic Impact study.

Recently we wrote about why the evolution of observability is naturally migrating toward open source. In that post, we also mentioned that there has been an explosion of open source tools that solve various problems of the complex puzzle that is observability.

In this post, we’ll help you sort through some of the noise as we discuss the major open source observability tools.

CNCF projects

We give particular emphasis to those projects that are part of the Cloud Native Computing Foundation (CNCF) or at least where the project’s sponsoring company is a CNCF member. The CNCF has established itself as the standard bearer of open source and is home to the Kubernetes project, which has transformed cloud native computing.

The CNCF categorizes its projects based on maturity level, with three progressive levels.

We limit this list to projects that have achieved incubating or graduated status, i.e. those known to be used successfully in production. Users should feel completely confident in any project that has achieved that milestone. Your risk tolerance should inform your adoption of incubating projects. Learn more about CNCF project maturity levels.

Prometheus

Prometheus is a monitoring and alerting system written in Go that collects metrics data and stores it in a time series database. It includes a powerful query language called PromQL (Prometheus Query Language) that lets users select and aggregate time series data in real time. Prometheus is commonly used in conjunction with Grafana (see below) for visualizing the data. Alerting is handled through Prometheus Alertmanager.

Prometheus was the second CNCF-hosted project after Kubernetes, so it has been battle-tested in production environments for years. It is a standalone open source project and is maintained independently of any company.

CNCF Status: Graduated
License: Apache 2.0
GitHub: https://github.com/prometheus/prometheus

Fluentd

Fluentd is a data collector written in a combination of Ruby and C that is used primarily for collecting logs from sources and sending them to desired destinations. It utilizes a plugin architecture for integrations with sources and destinations and currently has over 800 plugins available. No matter how obscure your endpoint, there is probably a Fluentd plugin for it. It also offers plugins for filtering or parsing the data before delivery.

Fluentd is another longtime project, originally released in 2011 and joining CNCF shortly after Prometheus in 2016.

CNCF Status: Graduated
License: Apache 2.0
GitHub: https://github.com/fluent/fluentd

Fluent Bit

Fluent Bit was originally intended to be an alternative to Fluentd. Written in C with a much smaller footprint and CPU utilization, it was created for use in containerized and embedded environments. However, in the last few years, its scope has expanded significantly. In addition to logs, it can also handle metrics and trace data, making it a single telemetry pipeline agent capable of handling all the traditional three pillars of observability. It also recently added eBPF capability through an integration with Aquasec’s Tracee tool, making it compatible with the next generation of observability data to be mined for insights.

Like Fluentd, FLuent Bit utilizes a plugin architecture for integrations with sources and destinations as well as parsing and filtering. It lacks the breadth of its older sibling’s integrations, but the current plugins cover the vast majority of production use cases. It also supports creating plugins in Go and WASM, making it much easier to develop custom plugins should the need arise. Fluent Bit’s ability to filter and parse data mid-stream exceeds Fluentd’s capabilities, and custom filters can also be written in Lua. As of Fluent Bit v2.0, it provides native support for the OpenTelemetry Protocol.

Note: Fluent Bit is sponsored and maintained by Chronosphere.

CNCF Status: Graduated (under the umbrella of Fluentd)
License: Apache 2.0
GitHub: https://github.com/fluent/fluent-bit

Jaeger

Jaeger is the final CNCF graduated project on our list. It allows developers to monitor and troubleshoot transactions in distributed systems by visualizing the chain of events in these microservice interactions. Jaeger connects data from different components to create a complete end-to-end trace.

Jaeger was created in 2015 by engineers at Uber to meet their needs for tracing on Uber microservices and was donated to CNCF in 2017. It achieved graduated status in 2019, the seventh project to do so. As of May 2022, it provides native support for the OpenTelemetry Protocol.

CNCF Status: Graduated
License: Apache 2.0
GitHub: https://github.com/jaegertracing/jaeger

OpenTelemetry

OpenTelemetry has taken the open source world by storm. Formed from the merger of the OpenCensus and OpenTracing projects in 2019, it is now the second-highest velocity project in the CNCF ecosystem, closely trailing Kubernetes. Its popularity is understandable given its goal of unifying tracing, metrics, and logging telemetry standards. It is, essentially, bringing law and order to what has been a wild west.

More than just standards, though, OpenTelemetry has evolved to become a collection of tools, APIs, and SDKs. Although it is still considered a CNCF incubating project, it has become so widely embraced that even commercial applications that have traditionally benefited from closed architectures are now loudly proclaiming their integrations with OTel. At this juncture, it seems safe to say that regardless of what your observability stack entails—open source, commercial products, or a mixture—it should be compatible with OpenTelemetry.

Note: OpenTelemetry applied for CNCF graduated status in March 2024. It is widely expected that the application will be approved.

CNCF Status: Incubating
License: Apache 2.0
GitHub: https://github.com/open-telemetry

Chaos Mesh and Litmus

Chaos Mesh and Litmus are both CNCF projects at the incubating stage that provide chaos engineering platforms. A relatively new discipline, chaos engineering tries to break systems through controlled experiments using random and unpredictable behavior in order to collect information about the failure. Both Chaos Mesh and Litmus were admitted to the CNCF in 2020.

CNCF Status: Incubating
License: Apache 2.0
GitHub (Chaos Mesh): https://github.com/chaos-mesh/chaos-mesh
GitHub (Litmus): https://github.com/litmuschaos/litmus

Thanos and Cortex

Thanos and Cortex both seek to make Prometheus highly available and horizontally scalable with long-term storage. The two projects clearly share many components with Prometheus, but they take a fundamentally different approach to how these pieces are joined together. Both are CNCF projects at the incubation stage.

CNCF Status: Incubating
License: Apache 2.0
GitHub (Thanos): https://github.com/thanos-io/thanos
GitHub (Cortex): https://github.com/cortexproject/cortex

OpenSearch

OpenSearch is a fork of the very popular Elasticsearch search and analytic suite. It was created in 2021 after Elastic (the parent company of Elasticsearch) changed the project’s license from Apache 2.0 to be dual licensed under the Elastic License (their own creation) and Server Side Public License (SSPL) in a move to make it difficult for cloud companies to sell managed versions of Elasticsearch. It was a move directly targeted at AWS and followed a similar move by MongoDB a few years earlier. AWS fired back that the change meant that Elasticsearch was no longer truly open source and announced the fork that eventually became OpenSearch.

Update: On August 29, 2024, Elastic announced they would be adding AGPL as a licensing option, making Elasticsearch truly open source again.

Like Elasticsearch, enterprises often already use OpenSearch to store and analyze their business, operational, and security data, so when they adopt an observability program many of the tools needed are already in place.

OpenSearch is not a CNCF project, although AWS is a platinum member of the CNCF.

CNCF Status: N/A
License: Apache 2.0
GitHub: https://github.com/opensearch-project/OpenSearch

Grafana

Created in 2014, Grafana is a powerful data visualization platform. It is commonly paired with Prometheus and allows users to create dashboards from which to monitor system performance. It utilizes a plugin system to integrate with data sources such as Prometheus or dozens of other options, open source and commercial. It began as a solution specifically for visualizing metrics, but as with so many other projects that originally focused on one pillar of the observability trio, it now supports logs, metrics, and traces.

Grafana is not a CNCF project; however, its parent company Grafana Labs is also a platinum member of the CNCF, like AWS. Grafana is also the only project to make our list that is not available under the Apache 2.0 license, instead utilizing the more restrictive Affero General Public License (GPL) v3. The move came in 2021, shortly after Elastic announced its decision to abandon Apache 2.0. In a statement announcing the move Grafana Labs stated that while they wanted more protection than Apache 2.0 offered, they wanted to remain with an Open Source Initiative (OSI) approved license and felt that AGPLv3 offered a good compromise.

CNCF Status: N/A
License: AGPL 3.0
GitHub: https://github.com/grafana/grafana

Building Your observability stack

Whether you are going with only open source solutions, commercial ones, or a hybrid approach, creating an observability program is a difficult task, and there is no one right solution for everyone. Hopefully, this guide has helped you to identify the major open source solutions available to you.

How Chronosphere can help

With Chronosphere’s acquisition of Calyptia in 2024, Chronosphere became the primary corporate sponsor of Fluent Bit. Eduardo Silva — the original creator of Fluent Bit and co-founder of Calyptia — leads a team of Chronosphere engineers dedicated full-time to the project, ensuring its continuous development and improvement.

Chronosphere Telemetry Pipeline streamlines log collection, aggregation, transformation, and routing from any source to any destination and also provides the ability to manage Fluent Bit agents as fleets. This allows companies who are dealing with high costs and complexity the ability to control their data and scale their growing business needs.

Chronosphere Observability Platform is the only observability platform built for control. Recognized as a leader by major analyst firms, Chronosphere empowers customers to focus on the data and insights that matter by reducing data complexity, optimizing costs, and remediating faster.

Talk to us today to learn how Chronosphere can help you control your observability efforts.

by Julius Volz


This is a guest article written by Prometheus co-founder 
Julius Volz of PromLabs in partnership with Chronosphere. Prometheus is an open source project hosted by the Cloud Native Computing Foundation (CNCF) under an open governance. PromLabs is an independent company created by Julius Volz with a focus on Prometheus training and other Prometheus-related services.

If you’re starting out with cloud native technologies like Kubernetes, you are going to need a monitoring system that is capable of monitoring both your infrastructure and the services that run  on top of it. Prometheus is an open source metrics-based monitoring and alerting stack that has become the de facto standard tool of choice for monitoring cloud native infrastructure and environments.

This article dives into five reasons that make it a no-brainer to choose Prometheus as part of your monitoring strategy when adopting cloud native technologies:

  1. Kubernetes and Prometheus work together seamlessly since they have excellent cross-support for each other.
  2. Prometheus comes with a powerful query language that allows you to flexibly select and transform your metrics data for dashboarding, alerting, debugging, and other use cases, in a unified way.
  3. Prometheus’ large adoption and mature ecosystem make it easy to find existing solutions for a variety of use cases or to get help from the community.
  4. The de facto standardization and portability of Prometheus make it possible to build your monitoring pipeline around a common standard while still being able to switch and migrate between compatible implementations with different tradeoffs as needed.
  5. Open source and open governance mean that you can trust the long-term health of the project, inspect and modify its source code, or even get involved with its development community, when needed.

Let’s explore each of these points in more detail!

Seamless integration with Kubernetes

Whether your infrastructure is running on-premises or in the cloud, most cloud native environments use Kubernetes at their core to run services and other compute workloads. This makes it critical to have a monitoring system that integrates well with Kubernetes – both to monitor Kubernetes itself, as well as the services that run on top of it.

Luckily, Prometheus and Kubernetes have excellent native cross-support for each other that enables this need:

This seamless integration between the two systems makes Prometheus an ideal choice for monitoring Kubernetes environments. Most organizations using Kubernetes will integrate with its Prometheus interfaces in one way or another, even if they do not directly use Prometheus as a monitoring server.

Thus, choosing Prometheus produces the least amount of friction when getting started with cloud native environments: It just works out of the box, and you don’t need any translation layers in your metrics collection pipeline that might reduce the fidelity of the collected data.

Powerful querying and precise alerting

After collecting and storing your metrics, Prometheus allows you to query this data in powerful ways using the Prometheus Query Language (PromQL). PromQL is a flexible query language built specifically for selecting and transforming metrics data. It allows you to evaluate operations such as:

Below is an example of a PromQL query that selects the percentage of HTTP requests that a service has handled over the last five minutes (grouped by the request’s path) that have resulted in an error status code of 500, limited to paths for which that percentage is greater than 5%:

  sum by(path, status) (
    rate(
      http_requests_total{job="demo",status=~"5.."}[5m]
    )
  )
/ on(path) group_left
  sum by(path) (
    rate(
      http_requests_total{job="demo"}[5m]
    )
  )
* 100
> 5

Given its capabilities, PromQL can unify many different use cases under a single query language. For example, PromQL is a great fit for:

The flexibility and generality of PromQL make Prometheus a great tool for getting the most use out of your collected metrics data. If you want to learn more about PromQL, you can read my earlier blog post about the top PromQL 3 queries to get you started!

Large adoption and mature ecosystem

We initially created Prometheus in 2012 and fully published it in 2015. This has given the project more than 10 years to become sufficiently mature and stable enough to be adopted as a mission-critical monitoring system in many enterprises. Whether it’s large tech companies, smaller startups, or even traditional institutions and banks, you can find Prometheus everywhere by now.

This large adoption has led to a large community of users and builders springing up who have contributed their own Prometheus integrations. As of December 2023, the Prometheus ecosystem officially includes:

The main Prometheus server open source repository has also accumulated 50K+ stars on GitHub over the years, further showing the project’s massive popularity:

This large adoption and vibrant ecosystem makes Prometheus both trustable and easy to integrate with. Whether you want to monitor a specific software component or integrate your monitoring stack with a third-party system, you are more likely than not to find an existing solution out there to help you out.

De facto standard and portability

Prometheus initially started out as a set of components that talked to each other to collect metrics, store them, query them, send alert notifications, or to write samples into a remote storage system. These components looked like:

With the growing adoption of Prometheus, users and vendors became interested in integrating with Prometheus in different ways to offer:

The result is a shift from viewing Prometheus as a set of specific component implementations, toward a set of interfaces that can be standardized and reimplemented in other systems.

The Prometheus team has done a lot of work toward standardizing these individual interfaces, along with building a suite of technical compliance tests and initiating a formal compatibility certification process that vendors will be able to go through to clarify their level of conformance to potential users.

As a result of this work, the various Prometheus interfaces have become de facto open standards within the monitoring landscape, and many players integrate with them. For example, Chronosphere makes use of Prometheus’ remote write interface to receive data from a Prometheus server, and it also offers a 100% PromQL-compatible querying interface in its hosted observability platform.

This kind of interoperability is great for the Prometheus ecosystem, as you can now find alternative — but compatible —  implementations that offer different tradeoffs. For example, one vendor’s solution may be cheaper and simpler than another, while another might be more expensive and complex, but also more scalable. This leads to a larger ecosystem and a healthy competition that makes Prometheus better for everyone involved.

Using an open source monitoring standard is also great for architectural portability: You might start out with a native on-premises Prometheus setup and later decide to move parts of your setup to a compatible cloud service like Chronosphere, without having to re-architect, migrate, or relearn everything. It enables you to standardize on Prometheus-compatible monitoring while being able to freely switch out architectural components as needed.

Open source and open governance

In contrast to many other projects, Prometheus is not only open source, but also openly governed. Here’s what this means and why it matters.

Open source

First of all, Prometheus is open source and available under the permissive Apache 2.0 license

This means that:

While many organizations have realized the benefits of basing their infrastructure on open source software over the last decade or two, Prometheus takes this one step further by adopting an open governance model.

Open governance

Despite being open source, many projects are still controlled by a single company, which frequently leads to conflicts of interest between the health of the project and that of the company. Here are some examples:

Contrasting this model, Prometheus is hosted under an open governance framework within the CNCF. This enables many companies and individuals to come together on neutral ground and to develop and maintain Prometheus, as well as care for its community collaboratively. While the CNCF owns the DNS domains, trademarks, and other primary assets of the project, the individual Prometheus team members make the technical and governance decisions around the project working together in the open

This structure ensures that Prometheus is not owned and controlled by a single company, and it also enables outsiders to participate and eventually even become Prometheus team members. Overall, this means that you can trust Prometheus to be around for a long time, with many stakeholders caring for the project’s health.

Conclusion

In this article, I explained why Prometheus is the natural choice for monitoring cloud native environments. On one hand, Prometheus excels on a technical level, due to its native cross-support with Kubernetes and its flexible PromQL query language. On the other hand, the project’s maturity, wide adoption, and open governance model have created long-term trust in the project and have built a large and growing community around the project.

Finally, with many of the project’s interfaces becoming a de facto industry standard, Prometheus-compatible implementations with different tradeoffs are springing up across the ecosystem. This gives users more choice than ever to choose a solution that suits their needs. 

While many organizations begin their cloud native observability journey with open source tools like Prometheus, they quickly run into major hurdles. Challenges faced when running their own open source observability in-house is the significant management overhead and tooling that is unreliable and slow. Chronosphere gives you the best of both worlds: A fully open source compatible solution that relieves the management overview and delivers best in class availability and performance.

For more information on Prometheus, check out the following articles from Julius:

by Jess Lulka

It’s no secret cloud native architecture and environments are complex. They’re composed of thousands of containers, microservices, applications, and infrastructure layers that all are simultaneously running and dependent on each other so systems (both internal and external) stay online. These components also produce high cardinality data – where there are multiple possible outputs for a single data point – which makes data management challenging.  

Cloud native’s complicated nature means that monitoring and observability solutions are a requirement to help effectively collect data, set alerts for issues, and provide insights to developers about overall system health. Prometheus was created with these goals in mind.

Prometheus is a cloud native, open source monitoring and alerting system, created by Mike T. Proud and Julius Volz with support from SoundCloud. It is directly supported by Kubernetes and was the second project from the Cloud Native Computing Foundation (CNCF). It is primarily written in Go, but other versions use Java, Python, and Ruby and is released through the Apache 2.0 license. 

Read further to learn about Prometheus’ main features, what it can do, its main use cases, and best practices. 

Prometheus monitoring components 

The primary Prometheus server is a standalone server that runs as a monolithic binary with no dependencies. Its main features include: 

All of these components come together to provide organizations an open source monitoring option that is reliable, has quick alerting, and is straightforward to set up and deploy.

What Prometheus can monitor and metrics types

Cloud native is more distributed and fragmented compared to monolithic environments; so developers need monitoring solutions that can reliably support high cardinality at scale.

As a CNCF project, Prometheus provides a unified solution for metrics collection and storage and alert generation, which makes it easier to monitor and understand complex systems. It is designed to monitor infrastructure, application, Kuberentes, and business metrics within a cloud native environment – in a way that is easy for developers to install and start using within a few hours. 

It uses metrics (or numerical measurements) to help developers understand how well their application or service is working. There are four main metrics types that developers can collect: counter, gauge, summary, and histogram. 

How Prometheus works and its architecture 

The Prometheus architecture includes a main server to scrape and store time series data, client libraries for application code instrumentation, a push gateway for short-lived jobs, special purpose exporters, and an alertmanager.

Prometheus architecture. Diagram adapted from https://prometheus.io/docs/introduction/overview/.

Prometheus’ multiple exporters to collect data across different sources. Developers can use exporters to get system, application, service, or custom metrics within their cloud native environment. The server has exporters for databases, hardware, logging, issue tracking, messaging systems, storage, HTTP, FinOps, and APIs. 

Once the exporters scrape metrics from instrumented jobs, the Prometheus server stores all collected data samples locally. It then applies rules to either aggregate data, record new time series, or generate alerts. 

What is PromQL?

PromQL is how Prometheus users can query their collected infrastructure and deployment data. It is designed to work with time series databases and provides time-related queries. 

Developers also use PromQL to manage Prometheus alerting. Developers can set rules in PromQL, which then communicates Alert Rules to the alertmanager to fire when necessary. 

Three basic queries developers can use to get started with Prometheus are: 

  1. Request rates: How many requests a service gets.
  2. Error rate percentages: Amount of errors a service encounters (relative to total number of requests).
  3. Service latency percentiles: How fast services respond to user requests.

With queries, users can display results either as a graph or tabular data to get a look at how specific services are performing and compare data against internal service-level objectives (SLOs).

When to use Prometheus 

Designed for cloud native environments, which have dynamic service-oriented architectures, Prometheus works best for organizations that collect straight numeric time series data and supports multi-dimensional metrics data collection. 

It’s also designed to be reliable. It’s set up without any network storage or remote device dependencies, so it runs even if other infrastructure is down. It also uses pull metrics collection by design, which means you don’t have to modify or redeploy existing code or applications. Furthermore, it also doesn’t require any extensive infrastructure to use for your cloud native environment. 

This makes Prometheus ideal for when you need a metrics/time series based alerting and monitoring system for cloud native infrastructure that doesn’t require a lot of hardware investment. For one server, developers need (at minimum) 2 CPU cores, 4 GB of memory, and 20 GB of free disk space.

When not to use Prometheus

Prometheus is designed for reliability first and foremost. It allows users to gather statistics and time series about your systems, but may not be as granular as organizations might need depending on the metrics that you collect. For example, for a service like per-request billing, Prometheus data might not be specific enough. 

Prometheus is also not designed to ingest logs or act as a primary dashboarding/visualization tool. For both of these use cases, developers will need different tools. Additionally, while it can run smoothly with up to 10 of million active time series, it is not designed to be a long-term data storage solution and does not automatically scale with environments.  

What are some Prometheus best practices?

With any open source tool use, organizations should have best practices in place to ensure smooth operation. For Prometheus specifically, make sure to: 

Select the best standard exporter

Because Prometheus uses exporters to scrape for default data, developer teams should research which exporter will work best for their foundational data collection and reporting needs – as it can play a large role in quality of data and overall monitoring strategy. Use feature overviews, recent updates, user reviews, and security alerts to figure out the most ideal exporter. For custom metrics, you’ll have to use manual code instrumentation to insert and generate the business metrics you want to collect. 

Label carefully to avoid confusion and extra storage

Use the exporter documentation to ensure that any collected data has all the necessary context and strive to have consistent labeling across monitoring targets. Each label that developers use in Prometheus uses resources, so developers want to have labels that are needed and used to oversee their cloud native environment – not take up unnecessary storage.

Set actionable alerts to reduce troubleshooting time

Monitoring strategies require planning and documentation so that developers and engineers know what’s happening when they get an alert. Before implementing Prometheus, determine what events and services are critical to monitor and what their thresholds are for receiving an alert, as well as what relevant information should be included.  

Know when you need to scale

Prometheus is a way to easily start open source monitoring for cloud native and Kubernetes instances, but it does require technical and developer resources to manage over time and has scalability limitations. Additionally, it can get complex for large scale enterprise environments that require data duplication or have spread-out infrastructure. Be sure to have indicators in place when it is time to either implement Prometheus’ Federation feature, use functional sharding, or work with a vendor.  

Before Chronosphere, Abnormal Security had trouble with metrics availability and scalability; their 10-12 million active metrics were tracking to hit 50 million. Knowing about this rapid growth and needing a service that could handle so many metrics, they reached out to Chronosphere to help manage their Prometheus instances. In doing so they saw increased stability, reliability, and reduced costs.   

Interested in more about the connection between Prometheus and cloud native? Learn five reasons why the two work best together.