Top 3 queries to add to your PromQL cheat sheet

Engineering preview card with image
ACF Image Blog

This is a guest article written by Prometheus co-founder Julius Volz of PromLabs in partnership with Chronosphere. Prometheus is an open source project hosted by the Cloud Native Computing Foundation under an open governance. PromLabs is an independent company created by Julius Volz with a focus on Prometheus training and other Prometheus-related services.

A man wearing glasses and a hoodie.
Julius Volz | Co-Founder Prometheus
10 MINS READ

The Prometheus query language (PromQL) is a key feature of the Prometheus monitoring system: It powers all major monitoring use cases, from dashboarding over ad-hoc querying to alerting and automation. PromQL allows you to select, aggregate, correlate, and otherwise transform your time series data in flexible ways to get the exact answer you want.

However, PromQL is also a complex language that works differently from most other observability vendors’ query languages that you may be used to. That can make it daunting when you are first approaching the language, especially if you are transitioning from a different platform to Prometheus.

However, there are a few common Prometheus query patterns that you will see again and again in PromQL, and that go a long way toward being able to use the language in an effective way on a day-to-day basis.

If you are monitoring any kind of request-serving system, you will usually want to measure and calculate the system’s key health indicators, also called service level indicators (SLIs). The most common SLIs to track for these systems are:

  1. request rates,
  2. error rate percentages, and
  3. service latency percentiles.

In this blog post, I will help make things less daunting for those new to Prometheus: I’ll dive in and explain three example PromQL queries that allow you to compute these service indicators. This should give you a good first impression of the PromQL query language, as well as help you get started on the right foot with monitoring your services.

1. Aggregate request rates

One of the key health indicators you want to measure about a service is how many requests it currently serves. Prometheus usually measures request counts using cumulative counter metrics, which continually go up for every handled request.

Instead of displaying the raw cumulative counter metric, you will want to figure out a current request rate based on how fast the counter goes up per second in a given window of time. To achieve that, you apply the PromQL rate() function to a set of counter metrics:

rate(api_requests_total[5m])

The rate() function calculates the per-second increase for each input counter time series, and it averages the result over the time range that you input. In this case, the average time range is five minutes, since the “[5m]” part selects a five-minute range of data from the set of counter time series with the metric name “api_requests_total”.

If you have 1000 counter series with the metric name “api_requests_total” that partition your API requests by various label dimensions, you will now get 1000 output rates  — one for each input series. The label dimensions that partition the metric name into many time series can be aspects like:

  • The HTTP status code
  • The HTTP method
  • The HTTP path
  • The service instance that served the request
  • The job (service) that the request happened in
  • Any other feature related to the request

This level of dimensional insight in Prometheus is great, but you will often want to show an overall request rate for your service, with some dimensions aggregated away. This is where the sum() aggregator comes in: It sums the values of many time series into fewer series, but still preserves the dimensions that you want to see in the result.

For example, if you want to show a total request rate aggregated over all service instances and status codes, you could query for:

sum without(instance, status)(rate(api_requests_total[5m]))

The “without()” clause belongs to the sum() aggregator and allows you to list the underlying labels to remove in the aggregation. You can also turn this around and use a “by()” clause to list the exact labels you want to keep. Then you would end up with a query like this, which mentions the opposite set of labels to achieve a similar result:

sum by(method, path, job)(rate(api_requests_total[5m]))

You will see this pattern of a sum() aggregator wrapping a rate() function every day in PromQL, and you can easily use this pattern to compute your aggregated request rates.

2. Measure error rate ratios and percentages

You will also want to track the percentage of errors that your service encounters, relative to the total number of requests. You can achieve this in PromQL by first selecting both the error rates and the total rates, and then dividing those sets of rates by each other.

For example, you could count any requests that had a response status code starting with a “5” (indicating a server-side error) as an error you want to track. You could then query for the error rate ratio like this:

sum without(status)(rate(api_requests_total{status=~"5.."}[5m]))/sum without(status)(rate(api_requests_total[5m]))

There are a lot of things going on in this query already. Here’s a quick explainer:

  • First, the rate() function computes the individual rates over all failed requests, as well as over all requests (no matter if successful or not).
  • Secondly, you use the sum() aggregator to aggregate the “status” label away, so you end up with just a total rate for each combination of method, path, and instance. This leads to compatible labels between the sets of series that you want to divide by each other.
  • Thirdly, you use the “/” division operator to divide both sets of time series by each other to generate a set of correlated ratios.

By default, binary operators such as “/” automatically match up time series between their two operands when the series have identical label sets. So in this case, you would divide the error rate for each dimensional combination of method, path, and instance, by the equivalent total rate. The output would be a list of time series that represents each individual error rate ratio.

If you want to get from a ratio (between 0 and 1) to a percentage (between 0 and 100), you can multiply the entire query by 100:

 sum without(status)(rate(api_requests_total{status=~"5.."}[5m]))/sum without(status)(rate(api_requests_total[5m]))*100

This kind of error rate percentage calculation is very common for request-serving systems, and you can use it for your own services as well.

3. Calculate latency percentiles from histograms

A third common aspect you will want to track for your services is how fast they respond to user requests. Because you may have some slow requests, some fast ones, and all kinds of latencies in between, you will usually not just track an average latency number, but an entire distribution. In Prometheus, you can use a histogram metric to track this information.

A histogram counts all your observed request latencies into a set of ranged buckets (exposed as one counter series per bucket) to give you a discrete idea of how many requests your service has handled for each latency range.

For SLIs, you will  often want to compute percentiles from histograms. For example, computing the 90th percentile latency answers the vital question: “What is the latency within which 90% of my requests complete?”

Without diving too deeply into the inner workings of Prometheus histograms, let’s explore how you can approximate a percentile from a histogram in PromQL using the histogram_quantile() function. Quantiles are just percentiles, with the only difference being that you  designate them between 0 and 1 instead of 0 and 100. So the 0.9th quantile is equivalent to the 90th percentile.

In the simplest form, you can calculate individual 90th percentile latency for each label dimension that the histogram tracks like this, with the first parameter indicating the quantile you want to compute (0.9) and the second parameter being the histogram:

histogram_quantile(0.9,rate(api_request_duration_seconds_bucket[5m]))

The rate() function around the input histogram is critical here: Histogram buckets are cumulative counters that start at zero when the service first starts and then just increase from there over time. Passing a raw histogram into histogram_quantile() returns a latency that is averaged over the entire uptime of the service, rather than over a shorter moving window that shows you the most recent latency behavior. Applying the rate() function around the individual histogram counter time series yields a derived histogram that only represents the relative bucket increments over the last few minutes.

The query above returns one 90th percentile latency value in seconds for each labeled dimension that the histogram tracks (e.g. method, path, and status code).  Having this level of dimensional insight is great, but you will often want to see an aggregate view of your system’s latency behavior. For example, you may want to know the overall 90th percentile latency for each combination of HTTP path and method, but not care about individual service instances or status codes.

Luckily, Prometheus histograms are structured so you can aggregate multiple subdimensions together in a statistically valid way with the sum() aggregator. You can then feed the derived, aggregated histogram into the histogram_quantile() function l to get the 90th percentile latency for each method-and-path combination:

histogram_quantile(0.9,sum by(method,path,le)(rate(api_request_duration_seconds_bucket[5m])))

Besides preserving the “method” and “path” labels in the sum() aggregator above, you also have to preserve the special “le” label that indicates the latency range for each histogram bucket time series. The “le” label name stands for “less than or equal” and its value defines the upper boundary of each bucket. For example, a bucket time series with an “le” label value of “0.5” would contain the number of requests handled so far that have completed within 500ms.

Don’t fret if you don’t fully understand how and why the query above works yet! It takes a while to fully make sense of histograms and their operations. The good news is that you can always query them using the exact same pattern as above: A histogram_quantile() call wrapping a sum() aggregation, which in turn wraps a rate() function around an input histogram. You can just copy this pattern and replace a few key parameters:

  • The input histogram name (here: “api_request_duration_seconds_bucket”),
  • The quantile to calculate (here: “0.9”),
  • The time range you want to average (or smooth) over (here: “5m”),
  • The dimensions you want to split the result out by (here: “method” and “path”).

This pattern is so standard that both Prometheus and Grafana will auto-complete it as a query snippet for you when you start typing “histogram_quantile” into your PromQL text input. This will help you get from raw histograms to aggregated latency percentiles in no time.

Get started with PromQL queries

The three query patterns covered in this PromQL tutorial article already cover a large part of your daily querying needs, especially when it comes to querying the most important system and service health indicators. Of course PromQL and Prometheus come with a lot more features and functionality for building flexible and detailed monitoring – all using open source code.

If you would like more Prometheus tutorials to learn PromQL in more depth, take a look at PromLabs’ self-paced online courses. Chronosphere customers are enjoying free access to these courses via an integrated training platform. Talk to your Chronosphere representative to learn how to get started with these courses today!

Besides helping you learn Prometheus, Chronosphere’s platform integrates other great tools that make your Prometheus-based monitoring journey a breeze: The Chronosphere Query Builder will help you build and understand your PromQL queries in a visual way, while the Query Accelerator ensures that your dashboards are blazingly fast, 24/7.

For more information on Prometheus, check out the following articles from Julius:

Share This: