Chris Ward: Prometheus and PromQL are open source projects but they are also standards in terms of working toward PromQL compatibility. So first we should probably explain what does being PromQL compatible mean?
Julius: PromQL doesn’t have an official spec document where every feature is described in detail. Basically the code is the spec, and the reason all of this came up is that PromQL’s popularity led other vendors and projects to say, hey, we are also supporting a PromQL-compatible interface right now. So you have Grafana cloud, Google cloud, AWS, and even open source projects like Chronosphere, Thanos, and Cortex. All these projects implement PromQL as part of their system, and sometimes it doesn’t actually work exactly the same way as in Prometheus. Ideally it does, and that’s when we would say it is actually fully compatible. But then some of these vendors, mostly ones that I haven’t mentioned, who say they’re compatible, but often have wildly different results in what a given PromQL query returns and what functions are actually supported.
Testing the Spec
In 2020, I started testing these different vendors and projects. Since there was no official spec, the approach to testing compatibility was to build a reference Prometheus server, which is just a normal Prometheus vanilla server, to show what PromQL is actually supposed to behave like. And we would have that server send the data it collects to a third party system using the remote-write interface or some other compatible interface, and then have a tester tool that runs a whole set of comparison queries against both endpoints.
Want to hear a deeper explanation from Julius about Prometheus compatibility testing? Zoom to 2:30.
Showcasing Compatibility
Julius: I wrote a series of blog posts about testing the various vendors, which explained these differences while also trying to factually and neutrally report as much as possible, in an effort to bring transparency into the space. The importance of this exercise was that a lot of vendors—especially hosted service providers but also products you deploy yourself—see a lot of marketing value in being able to say, “We support PromQL,” as it’s becoming the lingua franca of time series database (TSDB) monitoring. And they have an incentive to claim compatibility, even if it’s not true.
We wanted to avoid ecosystem fragmentation and user confusion. Meaning if a team uses one service, but it behaves differently than another, then they cannot switch easily. Or, if a server says they support PromQL, then it should support it in the same way as the upstream Prometheus server does. If it does not, in the worst case your alert will not generate. Your alerting expression would just silently fail and you might miss an outage. In a slightly more benign case, maybe some curves of the graph would be skewed. It’s really important to bring transparency into that space and make it clear to users when they’re choosing a solution whether it’s really PromQL-compatible or whether that’s marketing-speak.Chris Ward: And are people starting to use your PromQL compatibility testing as something to strive toward as different vendors and projects?
Julius Volz: Yes, a number of vendors have made improvements. Some only had to fix one or two little issues in their service and now they’re 100 percent-compatible, whereas before they were 99.x-percent compatible. Some other vendors, they’re still at around 70 percent of test cases working, but still have improved from a much lower number.
Stay tuned to this space to read recaps of videos on new topics ranging from PromQL to high cardinality to the rise of cloud native.