Chris Ward: What’s been your involvement with PromQL and Prometheus?
Julius Volz: I cofounded Prometheus together with Matt Proud at SoundCloud, originally coming from Google in 2012. Back then I was responsible for coming up with this query language, which admittedly is similar to the query language of the Borgmon monitoring system at Google where I had just come from, but still has plenty of its own functions and quirks. So I guess I can say I’m the original creator of Prometheus Querying Language also known as PromQL.
Chris Ward: That’s interesting because of course Borg also became Kubernetes which doesn’t have a query language.
Julius Volz: Yes Borg also didn’t have a query language but it forms a nice square of relationships, in that Borg inspired Kubernetes, and BorgMon was used for monitoring stuff on Borg, and that inspired Promeus, and Prometheus now works very well together with Kubernetes and vice versa. They also both have these Greek origin themes and so it all fits very well together.
Chris Ward: You mentioned this query language PromQL. Briefly explain what is PromQL and, possibly most importantly, why would I want to use PromQL with Prometheus? What am I trying to do with PromQL in relation to Prometheus?
Julius Volz: PromQL is the Prometheus query language. The idea in Prometheus is that you collect a lot of data about your systems in the form of time series before you do anything else. Then, to do anything useful with that data you use the PromQL language to slice and dice your data, to select the right data you want, to aggregate over it, to do rate computations, to even do advanced computations between whole sets of time series like arithmetic between whole groups of numbers that match together — that’s a big innovation in PromQL. And to then give you answers about what is happening in your infrastructure and your services if something is going wrong. Things like, how many requests per second are you getting, what’s the temperature in a given room, et cetera. It’s really a functional language that is highly optimized toward doing time series computations and systems monitoring operations.
Chris Ward: What were some of the design decisions behind PromQL? A lot of people are familiar with the concept of a query language like MySQL in traditional databases. What was the design decision to create something new instead of leveraging something pre-existing?
Julius Volz: On a very practical level we were creating Prometheus being inspired by Borgmon. We had worked with Borgmon’s query language, and, admittedly, PromQL is quite similar to what Borgmon did. So on one hand it was not having to invent a full new query language from scratch because we already knew one.
On the other hand PromQL and also the Borgmon query language have the advantage of being very specialized towards time series computations, especially surrounding a concept called vector matching where you can do binary operations, filter operations, and other types of operations between two entire sets of time series that have corresponding identities. This means with same label sets, operators will automatically match up all the identical label sets of both sides of the operator, do the operation between them, and give you results. (Listen at around 4:20 for Julius’s detailed example.)
Chris Ward: What are the most common queries, calculations, et cetera that people might want to do in PromQL?
Julius Volz: With pretty standard systems monitoring you have, for example, counter metrics that tell you how many requests a given service is handling. And then you have a rate function to calculate the rate of an increase of such a counter. Whether that’s requests or some other kind of event that is happening in the system, it tells you your rate per second of a counter. And then you may want to add a threshold to that to tell you if certain rates are too high, such as error rates. Or you may have histograms that tell you about the latency distribution of a service when it’s handling requests. And then you have a function that goes together with that kind of histogram metrics that tells you approximately what a 90th or 95th or 99th percentile would be for those request latencies.
You may have a lot of these queries that just compute ratios or sums between things like the disk usage example that I just mentioned. You have predictive queries that would predict if the disk is going to run full in one day based on the last four hours of usage.
Basically any kind of constructs that would help you monitor the request rates, error rates, latency distributions, utilizations, and saturations of your services and systems.