Open source monitoring and metrics landscape

on July 22nd 2021

Metrics and managing and understanding them is an essential part of any modern complex application. As with any active and busy technical ecosystem, there is a proliferation of competing open source monitoring standards. A handful emerges as the most popular solutions. Slowly, the community creates a standard that most projects follow in some way.

This blog post aims to help you navigate what’s available, what piece of the ecosystem puzzle an option fills, and crucially, how interoperable they are with each other.

There are three main components of an open source metrics solution:

  1. Instrumentation libraries and/or standalone clients that codify and standardize exposition and publishing of metrics.
  2. Metrics collection applications that discover and collect telemetry from monitored systems via pull or push.
  3. Metrics storage and aggregation for long-term storage of telemetry, typically using a time series database (TSDB) and aggregation service or batch job.

Setting the stage

Before beginning, it’s useful to set some context on recent developments in the ecosystem. That’s not to say that there haven’t been decades of development in metrics tooling, but the past ten years were particularly active.

The rise of Prometheus as the open source monitoring standard

Created in 2012, Prometheus has been the most popular option for cloud-native observability since 2015. A central part of Prometheus’ design is its text metric exposition format, called the “Prometheus exposition format,” stable since 2014. Its creators took special care to make the format easy to generate and ingest by machines, but also understandable by humans. As of 2020, there are more than 700 publicly listed exporters, an unknown number of unlisted exporters, and thousands of native library integrations using the format. There are dozens of compatible backends alongside various open source projects and companies that support the format in some way. 

Before the emergence of Prometheus, there was Graphite, and prior even to Graphite tools like Nagios were popular (and still are) that work by periodically sending probes to specific endpoints. This process generally needed custom code written to make probe requests and would only return yes/no type answers, often with context lacking for debugging. 

Open source monitoring client and instrumentation libraries

OpenMetrics

OpenMetrics is a Cloud Native Computing Foundation (CNCF) sandbox project that works on a standard metrics format to bring all metrics and metrics types to the Prometheus exposition format. Any tool that wishes to be considered as Prometheus compatible needs to now pass the OpenMetrics test suite. The project hopes that standardizing the metric format across ingest, storage, and query will enable the metrics community to reach backward compatibility and overall greater metrics adoption. As of 2020, dozens of exporters, integrations, and ingesters both use and preferentially negotiate OpenMetrics.

Similar to Open Telemetry, the CNCF Observability technical advisory group (TAG) supports/recommends adopting OpenMetrics and the use of Prometheus’ text exposition format to all other projects that use non-standard exposition formats.

Micrometer

Unsurprisingly, there are always many proposed standards, and Micrometer is a standardized metrics client library project from Pivotal, albeit mostly in the JVM-based software space. It takes a different approach to OpenMetrics, working as an abstraction layer for applications on top of dozens of monitoring systems instead of an underlying standard. Now Micrometer also offers the ability to collect metrics through the exposition of metrics with the Prometheus text format.

Open source metrics data collection

Graphite

Graphite collects, stores, and visualizes metrics data using a handful of different open source components. It’s older than Prometheus but still popular and with a well-established ecosystem around it. Graphite uses the “push” model, meaning that applications send metrics data to a Graphite “Carbon” daemon, then stores data in the “Whisper” database, and visualizes the data in either the Graphite web application or Grafana with “Graphite-Web” acting as the query service.

StatsD

StatsD aims for simplicity and acts as a network daemon that listens for statistics, like counters and timers, sent over UDP or TCP, and sends aggregates to one or more pluggable backend services. StatsD is another example of a push-based collector, listening for compatible messages sent from client applications. You can use it to publish data once aggregated (i.e., using StatsD daemon or Statsite) to a Graphite “Carbon” processor or another compatible backend.

Prometheus

Prometheus (CNCF project) is more than a metrics format. It also handles the instrumentation of applications with client libraries alongside storing, querying, and alerting on metrics data. It helps visualize the data it stores with a basic in-built visualization web application, and by integrating with Grafana. It pioneered the idea of the “pull model” instead of the more commonly used “push model” of metrics tools that came before it. With this model, Prometheus pulls metrics from sources across HTTP instead of sources pushing to Prometheus. Prometheus also introduced a new way of representing metrics from Graphite and StatsD which generally represent metrics as a name, value, and timestamp. Prometheus represents metrics in a more complex but more flexible format, adding “labels” for extra metadata about metrics.

Telegraf

Telegraf is a server agent for collecting, processing, aggregating, and writing metrics. It’s plugin-driven and has four distinct plugin types for each of the four tasks it handles. It’s another example of a push-based collector but receives metrics from any of the collection plugins you enable.

Open source metrics visualization

Grafana

Grafana dashboards began as a fork of Kibana (which is no longer open source and instead uses a source available license) and has grown into a popular open source project for visualizing metrics and building dashboards. Most recently, Grafana Labs changed the Grafana license from Apache v2 to a less liberal copyleft AGPLv3 license.

Chronograf

Chronograf (no relation to Chronosphere!) is a user interface specifically designed to visualize InfluxDB times series. Chronograf is the primary way that Influx users create dashboards, alerts, and automation rules for their metrics.

Open source time series databases for long term metrics storage

When needing to store metrics data for more than a few weeks, most companies look to a time series database for long-term, reliable, and affordable storage. Typically metrics databases are optimized for short-term storage, and a handful of projects aim to fill that gap to offer scalable and performant long-term storage and querying.

Many of the projects work in similar ways offering Prometheus remote storage compatibility, high availability, optimizing data for long-term storage, storing data in a variety of backends, and querying the same data across any instance in a cluster. The differences come in the details for implementation (internally) and how you fit the solution into your existing architecture.

M3 is an open-source project originally created by Uber that focuses on high performance and compatibility, supporting a wide variety of ingestion formats by default. Apache v2 license.

Cortex is an open-source CNCF incubation project originally created by Weaveworks and used by many different businesses and sectors as well as the basis of a handful of SaaS products. Apache v2 license.

Thanos is an open-source CNCF incubated project originally created by Improbable and used by many different businesses and sectors. Apache v2 license.

VictoriaMetrics is an open source (but not always) database. It is compatible with Prometheus and other ingestion sources, but uses many custom components, such as “MetricsQL,” and offers different features in its community and enterprise versions. Apache v2 license.

InfluxDB is a metrics database and an open-source time series platform that many use for metrics data. Coupled with Telegraf, Influx is a good choice for long-term storage. It uses an open core model, meaning that you have to pay for certain features, and uses a custom query language for querying data called “InfluxQL.” Community edition MIT license, enterprise license for cluster edition.

Compatibility equals flexibility

The open source monitoring ecosystem is rapidly converging on Prometheus compatibility, but this doesn’t mean there is a lack of options, quite the opposite. Think of Prometheus as a core or starting point of your monitoring setup, and build upon it as your needs grow. If you ensure that whatever you use is fully Prometheus compatible, you have nothing to loseand only things to gain.

Other resources you may be interested in

Interested in what we are building?