Prometheus is the recommended open source metrics monitoring platform by the Cloud Native Computing Foundation (CNCF). Focused on being easy to get started, it was designed as a single binary for ingestion, storage, and query. Prometheus has many widely adopted features and functionality, including its text exposition format, efficient metric store, and native query language called the Prometheus Query Language (PromQL). For metrics stored within Prometheus, PromQL is the main way to query and retrieve the results you are looking for. Learn more in our “An Introduction to PromQL” blog.
Prometheus has four primary metric types to query for via PromQL: counters, gauges, histograms, and summaries. These metric types fulfill the needs or requirements for most use cases, and are found in Prometheus’ official client libraries: Go, Java, Ruby, and Python. Note: each client library has its own documentation on how to use each metric type with the corresponding client library API.
The remainder of this blog will provide an overview on the four primary metric types, including when and how to use them.
The four metric types
Counters are a fundamental way for tracking how often an event occurs within an application or service. They are used to track and measure metrics with continually increasing values (i.e. monotonically increasing values) which get exposed as time series. An example of a counter metric is
http_requests_total, which reports the running total of HTTP requests to an endpoint on an application or service. The
rate() function is applied to counters at query time in order to measure or calculate how many requests happen at a given time per second.
Counters are running or cumulative counts with metric client libraries that keep an ever increasing total sum of the number of events for the lifetime of the application. These events can be periodically measured by having Prometheus scrape the metrics endpoint exposed by a client library.
Running counts are highly reliable in that they allow for interpolation of any missed sample collections, resulting in close approximations for an aggregation or total sum of values at a point in time. However, if you are wanting to aggregate a running sum of many counts, you would need to apply a
rate() function to visualize the changes per second on each count and then aggregate the counts with
The below graph shows an example of a running or cumulative counter metric with the
rate() function applied on a single count.
Gauges are used to periodically take measurements or snapshots of a metric at a single point in time. A gauge is similar to a counter, however, their value can arbitrarily increase or decrease over time (e.g. CPU utilization and temperature).
Gauges are useful for when you want to query a metric that can go up or down, but don’t need to know the rate at which it’s doing so. Note: the
rate() function does not work with gauges as rates can only be applied to metrics that continually increase (i.e. counters).
The below graph shows an example of a gauge metric measuring CPU utilization as a percent over time.
Histograms sample observations by their frequency or count, placing the observed values in pre-defined buckets. If you don’t specify buckets, the Prometheus client library will use a set of default buckets (e.g. for the Go client library, it uses .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10). These buckets are used to track the distribution of an attribute over a number of events (i.e. event latency). Note: the default buckets can be overridden if more or different values are needed, but it’s important to note the potential increase in costs and/or cardinality when doing so as each bucket has a corresponding unique time series.
Overall, histograms are known to be highly performant as they only require a count per bucket, and can be accurately aggregated across time series and instances (provided they have the same buckets configured). This means that you can accurately aggregate histograms across multiple instances or regions without having to emit additional time series for aggregate views (unlike computed percentile values with summaries).
There are a couple of downsides to histograms, the main one being that you need to pre-define boundary values for your histogram buckets. Because code modifications are needed to change the buckets, you need to think about the expected latency ranges ahead of time and configure the buckets accordingly. Additionally, if you want to read your histograms as percentiles or quantiles to better understand the distribution, then you would need to apply the
histogram_quantile() function to estimate the requested quantile.
The below graph is measuring the event count at various latencies (
le as 0.002, 0.004, and 0.008 seconds) over the time of a histogram. (Note: in ascending order, each bucket has a cumulative count of events that fall into that bucket plus all preceding buckets).
This second graph shows the histogram as a quantile by applying the
histogram_quantile() function to measure event latency at various quantiles (P50, P75, P99).
Summaries are similar to histograms in that they also track distributions of an attribute over a number of events, but they are different in that they expose quantile values directly (i.e. on the client side at collection time vs. on the Prometheus monitoring service at query time). They are most commonly used for monitoring latencies (e.g. P50, P90, P99), and are best for use cases where an accurate latency value or sample is desired without configuration of histogram buckets.
In general, summaries are not recommended for use cases where a histogram can be used instead. This is because quantiles cannot be aggregated, and it can be difficult to deduce what timeframe the quantiles cover. Note: this is defined by each client library independently (e.g. Prometheus Go client library uses 10 minutes by default).
Once client side quantiles are calculated, they cannot be merged with a quantile value from another instance. This means that summaries cannot be aggregated with any level of accuracy across time series. For example, the average of two P95 values does not equal the P95 for the combined set of values.
In the below graph, event latency is measured by estimating pre-defined quantiles (P50, P75, P99) from a set of client side observations.
Putting it all together
The above overview provides an introduction to the primary Prometheus metric types, including some recommendations on when and how to use them. To learn more about how to implement each type, refer to the documentation for the various Prometheus client libraries for Go, Java, Ruby, and Python. Prometheus also provides more in depth documentation on implementing these metric types.
To recap, here is a quick summary of each Prometheus metric type that we discussed:
- Counters are used for tracking continually increasing counts of events, and are queried using the
rate() function to measure how often an event occurs over a given time per second.
- Gauges are used to provide the current state of a metric that can arbitrarily increase or decrease over time, such as CPU utilization.
- Histograms are used for measuring the distribution of observations and putting them into pre-defined buckets. They are highly performant, and values can be accurately aggregated across both windows of time and numerous time series. Quantile and percentile calculations are done on the server side at query time.
- Summaries are used for monitoring latencies, and are best of use cases where an accurate latency value is desired without configuration of histogram buckets. They cannot accurately perform aggregations or averages across quantiles, and can be costly in terms of required resources. Calculations are done on the application or service client side at metric collection time.
At Chronosphere, we provide a Prometheus-native and completely PromQL compatible metrics cloud monitoring solution along with optimized query and graphing functionality that works with the four primary metric types. If you are interested in learning more, please reach out to email@example.com or request a demo.