Workshop: Using advanced Prometheus queries

Blog

Expand your PromQL query toolbox and learn about more advanced queries for collected metrics data visualization.

On: Oct 5, 2023

6 MINS READ

Are you looking to get away from proprietary instrumentation?

Are you interested in open source observability but lack the knowledge to just dive right in?

This workshop is for you, designed to expand your knowledge and understanding of open source observability tooling that is available to you today.

Dive right into a free, online, self paced, hands-on workshop introducing you to Prometheus. Prometheus is an open source systems monitoring and alerting tool kit that enables you to hit the ground running with discovering, collecting, and querying your observability today. Over the course of this workshop you will learn what Prometheus is, what it is not, install it, start collecting metrics, and learn all the things you need to know to become effective at running Prometheus in your observability stack.

Previously, I shared an introduction to Prometheus, installing Prometheus, an introduction to the query language, and exploring basic queries as free online workshop labs. In this article you’ll continue your journey using advanced Prometheus queries with PromQL .

Your learning path dives deeper into using advanced PromQL queries. Note this article is only a short summary, so please see the complete lab found online here to work through it in its entirety yourself:

The following is a short overview of what is in this specific lab of the workshop. Each lab starts with a goal, in this case it is fairly simple:

This lab takes you deeper into PromQL expanding your query toolbox with more advanced queries for visualizing collected metrics data.

A start is made by looking back in review, sharing how you’ve gotten an understanding of how to build and execute basic PromQL queries so far. You’ve done this up to now using the default Prometheus console expression browser and graphs.

For this lab you’ll be diving deeper into PromQL and to broaden your knowledge of the tooling available, you’ll install, configure, and query using an open source query tool called PromLens. This is one of the best assistants you can find to help you build and understand what you are querying while seeing the results directly.

Installing PromLens

Your first task is to install on your machine PromLens, a standalone tool for learning PromQL and displaying insights into the queries you are running.

To test out your new installation, you dive right into the concept of nested queries. PromQL expressions are not a single query, but often a set of nested expressions, each one being evaluated and used as an argument or operand to the expressions above it in the nested structure. You run examples of nested queries and explore their results using the teaching aspects of PromLens in the explainer tab.

When this query is entered:

rate(demo_api_request_duration_seconds_count{job="services"}[5m])

You see the explanation of each part of this query in PromLens like this:

These explanations are extremely valuable when you are first starting out and trying to master a complex functional language like PromQL.

Language theory

Before diving in further, you explore some of the language theory and definitions that are crucial to you learning to use PromQL effectively. There are two concepts of expression type when talking about querying Prometheus and it’s crucial you’re able to understand the differences:

metric type: as reported by a scraped target: counter, gauge, histogram, summary, or untyped.
results type: data type of a PromQL expression: string, scalar, instant vector, or range vector.

PromQL has no concept of metric types. It’s only concerned with expression result types. Each PromQL expression has a type, and each function, operator, or other type of operation requires its arguments to be of a certain expression type.

Not only the expression types exist, but there are also 10 different node types, which are the types of queries or expressions you can write. Here is the list with details about each one:

number literals: 6.45
string literals: "hello o11y" occur infrequently, used as parameter values to functions.
instant vector selectors: some_metric{job="services"} were previously explained.
range vector selectors: some_metric{job="services"}[15m] were previously explained.
aggregation: sum by(job) (some_metric) allows aggregating over multiple series, always yields an instant vector.
unary operators: -some_metric negates any scalar or instant vector values, returns same type as it was applied on.
binary operators: some_metric_1 + some_metric_2 returns scalar if both operands are scalar, otherwise vector.
function calls: rate(some_metric[15m]) takes input parameters of varying types, returns varying types.
sub-queries: (expression)[1d:] takes instant vector expression as input, returns a range vector.
parentheses expressions: (42) may return string, scalar, instant vector, or range vector, depending on usage.

Feels like we are entering the realm of mathematics and you might even remember some of this theory in your computer science courses from university, no? Don’t worry, just the short foundational theory is covered before you jump right back into the hands-on application of it all.

Advanced topics

You jump right into the more advanced topics like histograms, quantiles, learn to calculate latency, aggregate away extra metrics dimensions (cardinality problems), applying filters, creating queries with thresholds for alerting rules, filter with time series data, filter with booleans, explore the set operators available to you (AND, OR, UNLESS), explore metrics with timestamps, start manipulating metrics with timestamps, set up detection queries to discover slow batch jobs, setup a second services demo instance to explore how to query for running instance health in your infrastructure, and learn how to smooth out spiky graphs you generate with complex queries.

This was pretty fun to see so let’s slow down here and share the spiky graph generating query:

go_goroutines{job="services"}

Which indeed does produce something pretty ugly:

To make this graph more useful you smooth it out using averages over time as follows:

avg_over_time(go_goroutines{job="services"}[10m])

Which sorts out the graph into something you can make sense of:

There is so much you learn in this lab that it does not fit into an article, so make sure to take your time and run through this lab and you’ll be running advanced queries to solve all kinds of observability issues!