In this episode of the Future of Observability video series, Julius Volz, co-creator and maintainer of PromQL and Prometheus, shares his insights with and why he’s trying to make it harder to build PromQL queries that fail silently.
On: Nov 23, 2022
Chris Ward: What’s the roadmap for the next year or so for PromQL?
Julius Volz: So I would say PromQL is relatively stable by now, though we have been more open with regards to adding certain features than in the past. More statistical functions might come in the future, but we’re unlikely to see a very radical change at least before Prometheus 3.0 comes out at whatever time that might be. Sometimes, in future years that would be a big change but at least without introducing breaking changes and while we still have Prometheus 2.0, it’s not going to change too radically. The one thing that may be a quite interesting and potentially radical change without actually breaking anything is support for high definition histograms which Brian Robstein and others are defining and currently working on.
And so that is basically not just a change in the language but also a change in the type of histogram metrics that Prometheus can store and process and so on all the way down to the storage layer. That could, with some luck, happen this year. That would be really cool because it is going to enable people to on a very high level have cheaper histograms and cheaper means you can store more data and higher resolutions. So in the end you get better insight into the request latency distribution of your service for example at a much lower cost.
Chris: In addition to the core team, do some of the commercial vendors that use PromQL in their hosted products and own products contribute ideas and language suggestions as well? Or is it generally a matter of getting most things people need?
Julius: For sure, people do contribute things. I would say there’s probably one project I won’t name right now that was unhappy that we didn’t add certain features and so they went off and created their own Prometheus-style software which is totally fair. But the Prometheus team has been a bit on the conservative side, especially historically. Now not so much anymore about completely changing some principles behind the language and adding some new functions where we weren’t sure whether they would behave the way we expect in the long run.
So there’s a whole discussion about our rate implementation versus one that is called x rate and there’s subtle differences between them. The x rate probably also has its benefits but you would have to prove that there’s not other additional problems with it. But for sure people have contributed things I would say the language hasn’t changed in a major way due to that most people just went and implemented what we had — or at least tried to.
Chris: Final wrap-up question. Two, three, four years in the future what do you feel will be the state of Prometheus and PromQL and what would you like it to be?
Julius: Ideally, if it’s really that long into the future it would be great for there to be a Prometheus 3.0 at some point that irons out some of the kinks of the current language design. It would be great to have first class metric typing in the time series database and the query language. Because right now everything is just a flat time series and just some functions expect a counter, but they can’t really check if it’s a counter. Or they expect a histogram but they can’t really check that. They will not give you an error, they will just give you a wrong result for example, and having more of that kind of support would be really great for users.
Julius: The other thing is, again, high definition histograms. I hope that happens sooner than (the Prometheus 3.0 release). And then making it harder to build queries that fail silently. Currently there are a lot of ways, especially around binary operators, filter expressions, and even metric selectors due to Prometheus’s schema-less data model to create selectors, binary operators, et cetera that just yield an empty output. And an empty output in the alerting use case means everything is fine, but maybe not everything is actually fine just because your expression didn’t produce an output. It could also have been that you wrote a metric name wrong so it didn’t select anything that you didn’t get the binary operator matching quite correctly between two different sets of time series. So it also just gives you an empty output versus some kind of warning or error or something like this.
It’s kind of a nice, super-flexible language that doesn’t have super strict error checking and expectations of the schema, because data can look whatever like whatever you want, but on the other hand yeah it does allow for a lot of cases where people write queries that don’t behave the way they want and in the worst case silently failing alerts. It would be great if someone has ideas for how to improve that in a new version of a language at some point.
And then in Prometheus in general there’s just a lot more integration happening now I would saw around Prometheus you know even in tools like Chronosphere, Grafana and others where people are marrying together the Prometheus data model and query language with other signal types so you have traces, logs, et cetera, and you can jump between them if they come from different systems so you’re more easily able to correlate some kind of histogram that shows you that there’s a lot of slow events happening to jumping to a specific slow event and telling you why it was slow in a trace.
Keep an eye on this space for new video and conversations about the future of observability. Also, make sure to subscribe to the Chronosphere YouTube channel so you don’t miss any future videos.
Request a demo for an in depth walk through of the platform!